Tải bản đầy đủ (.pdf) (9 trang)

DSpace at VNU: Extraction of Disease Events for a Real-time Monitoring System

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (778.71 KB, 9 trang )

Extraction of Disease Events for a Real-time
Monitoring System
Minh-Tien Nguyen

Tri-Thanh Nguyen

Hung Yen University of Technology and
Education (UTEHY).
Knowledge Technology Laboratory (KT-Lab).

Vietnam National University, Hanoi (VNUH),
University of Engineering and Technology (UET).
Knowledge Technology Laboratory (KT-Lab).





ABSTRACT
In this paper, we propose a method that uses both semantic rules and machine learning to extract infectious disease
events in Vietnamese electronic news, which can be used
in a real-time system of monitoring the spread of diseases.
Our method contains two important steps: detecting disease events from unstructured data and extracting information of the disease events. The event detection uses semantic
rules and machine learning to detect a disease event; in the
later step, Name Entity Recognition (NER), rules, and dictionaries are used to capture the event’s information. The
performance of detection step is ≈77,33% (F-score) and the
precision of extraction step is ≈91,89%. These results are
better that those of the experiments in which rules were not
used. This indicates that our method is suitable for extracting disease events in Vietnamese text.

Categories and Subject Descriptors


H.2.8 [Database Applications]: Data Mining

General Terms
Data Mining; Information Extraction

Keywords
Data Mining; Information Extraction; Event Extraction; Disease Event Extraction; Monitoring Systems

1.

INTRODUCTION

Information from electronic newspapers provide valuable
inputs for public health surveillance, early outbreak detection, and disease monitoring systems. When the presence
of a disease is announced by the government and published
on a webpage, it is typically called disease event or an infectious disease outbreak. Unfortunately, the electronic resources of infectious diseases are multidimensional, chaotic,
and not well organized, so extracting useful patterns from
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are not
made or distributed for profit or commercial advantage and that copies bear
this notice and the full citation on the first page. Copyrights for components
of this work owned by others than ACM must be honored. Abstracting with
credit is permitted. To copy otherwise, or republish, to post on servers or to
redistribute to lists, requires prior specific permission and/or a fee. Request
permissions from
SoICT’13, December 05-06, 2013, Danang, Viet Nam
Copyright 2013 ACM 978-1-4503-2454-0/13/12 $15.00.
/>
these sources is really challenging. "How to detect an infectious disease event?" and "how to extract information of an
infectious disease event?" are two important questions which

are deeply focused on this paper.
Disease detection and disease spreading/outbreak monitoring are extremely meaningful issues in society, especially
when the diseases are dangerous and have high ability of
infection. Because an infectious disease normally outbreaks
in a short time and spreads very quickly over a large area,
so it can bring to emergency circumstances not only for the
citizens, but also for the government and economy. Therefore, monitoring infectious disease outbreak is really crucial
in prevention, handing diseases and helping the authorities
to make suitable decisions.
In this paper, we propose a model to automatically detect and extract information of human infectious disease
events from Vietnamese webpages based on semantic rules
and machine learning. The model includes two important
components: disease event detection and disease event extraction. In the first component, an infectious disease event
is detected from free text, after that, the information of an
event (time, disease name, and locations) is extracted in the
second component. Subsequently, we combine the extracted
information to form an infectious disease event. This infectious disease events can be the input for our monitoring
system for visualization.
Our paper is organized as follows: related work is in Section 2; our method will be discussed in Section 3 in which
event detection is mentioned in Section 3.3 and event extraction is in Section 3.4. Section 4 gives experiments, results,
and explains the source of some errors appearing in our research. The last section is conclusion.

2.

RELATED WORK

Event extraction was first introduced as an important
topic in 1987 in Message Understanding Conference (MUC)
[11]. In MUC, an event is defined as: "an event must have
actor, time, place and impact on the surrounding environment". Later, in Automatic Content Extraction (ACE) program, Doddington George R., et al. gave an event definition:

"an event is an activity that was created by participants" and
divided events into eight types: Life, Movement, Transaction, Business, Conflict, Contact, Personnel and Justice [7].
Moreover, as Allan J., et al. stated, an event includes four
attributes: modality, polarity (Positive, Negative), genericity
(Specific, Generic), and tense (Past, Present, Future, Unspecified) [10]. Grishman R., et al. gave the definition of a
disease event as a template: Disease Name, Date, Location,

139


Victim Number, Victim Descriptor, Victim Status, Victim
Type, Parent Event [9].
Hogenboom F., et al. provided a general guideline on how
to select a suitable method for event extraction purpose [2].
The guideline indicated that event extraction approaches
can be listed as data-driven, knowledge-driven, and hybrid.
Each approach has both advantages and disadvantages. Hogenboom F., et al. compared the benefits and drawbacks among
these methods. Finally, the authors pointed out the hybrid
approach prevails.
Event extraction from unstructured text can be applied
in many fields, especially in disease domain. Grishman R.,
et al. used linguistic event patterns (120 patterns) to analyze sentences to capture information of a disease event [9].
These linguistic patterns were built on word classes and relation among them. For example, pattern "np (DISEASE)
vg (KILL) np (VICTIM)" will match a clause like "Cholera
killed 23 inhabitants". An event is recognized based on the
trigger of two noun phrases: "outbreak of..." and "people
died from...". These patterns were applied to extract disease
events and achieved F-score of ≈53,98%. Normally, applying
linguistic patterns can achieve high results if these patterns
cover the whole dataset, but preparing these patterns is always time-consuming and requires domain experts. Moreover, the patterns must be changed when the data fluctuate.

Finally, because the patterns were built on word classes, so
the authors must identify word classes (e.g., noun phrase,
verb phrase, etc.), but in some other languages (e.g., Vietnamese or Chinese), this is more challenging. Because of this
drawback, we do not follow this approach.
Volkova S., et al. mixed entity recognition and sentence
classification to extract animal disease events [4]. The event
recognition consists of three main steps: the first step is
entity recognition from unstructured texts; secondly, sentences are classified based on these entities; finally, the entities within an event sentence are combined into a structured
tuple. In the event recognition, true events should contain
a disease name and a disease-related verb. The authors got
the precision of 75% and 65% in event tuple recognition
and the sentence classification, correspondingly, with the
features from WordNet and Google-Set corpus. However, using a list of verbs to confirm an event can badly affect the
event extraction in Vietnamese language because the lacking
of resources for Natural Language Processing (NLP) (such
as Vietnamese WordNet or Google-Set like corpus for Vietnamese) or the performance of parsing utility is not high
enough. Thus, we do not use this method.
Doan .S, et al. built a Global Health Monitor system
which shows the disease spreading state around the world
[5]. The system includes three main steps: topic classification, Named Entity Recognition (NER), and disease/location
detection. Na¨ıve Bayes classifier is used in topic classification
with the precision of ≈88,10%, and F-score was ≈76,97% in
entity recognition step with Support Vector Machine (SVM),
and the final step achieved the precision of ≈93,40% with
BioCaster Ontology. However, there are some limitations
in this system. The first limitation is the location ambiguity, because some locations are not mentioned clearly in
input data (they are only provinces/cities, lacking of country
name), then the system can’t recognize the location exactly.
Furthermore, BioCaster system can’t detect new diseases or
locations that are not in the ontology.

Our approach uses the advantages of both semantic rule-

based method and machine learning in two main components: event detection and event extraction. In the event
detection, while the semantic rules play the role of a data
filter, the classification model distinguishes that a news article contains an event or not. Because our rules are used as a
filter, so it is simpler than those in the research of Grishman
R., et al. [9]. A rule in our study is a short phrase which is
composed of a noun phrase and a verb phrase instead of a
complete sentence. Moreover, we do not use a list of verbs
to confirm events as Volkova S., et al. [4], because, typically,
this method depends on the coverage of verbs and building
these verbs always takes much time. In the event extraction,
our approach is similar to the method of Doan S., et al. [5].
We use rules, a disease dictionary, a NER, and a location
dictionary for extracting information of a disease event.
In addition, there are several systems which extract events
from online news. Grishman R., et al. built Proteus-BIO
system where users can follow infectious diseases [8]. Data in
this system are collected from webpages and disease reports
from World Health Organization 1 and ProMed 2 . Collier N.,
et al. made BioCaster system 3 which follows several event
types, especially disease events around the world. Similarly,
HealthMap 4 was built by Freifeld Clark C., et al. where
users can monitor disease all over the world [6].

3.

INFECTIOUS DISEASE EVENT DETECTION AND EXTRACTION

3.1


Infectious Disease Event Characteristics

An investigation on our data domain indicates that an
infectious disease event may contain a disease name, time,
locations, and victims. In some cases, it may have additional
information such as the methods or the environment of infection. Though Grishman R., et al. [9] used a disease name,
the time and the location of the outbreak, the number of affected victims, and the type of victims as the information
of a disease event, we only focus on three basic information:
the time, locations of the outbreak and the infectious name
disease. We ignore the methods or environment information
because we collect data from webpages instead of medical reports, so such information is not clearly mentioned in most
cases. Moreover, an event in MUC must include an actor
[11], in our study, the actor is equivalent to a disease, therefore we use the disease name instead of the actor.
In addition, a closer examination on disease news articles showed that a disease name is sometimes similar to a
symptom, so this is one of the reasons of confusion in the
event extraction. For example, ‘pneumonia’ is the symptom
of ‘bird flu’ (A/H5N1), but it was recognized as a disease in
some cases.

3.2

Problem Definition

The infectious disease event extraction problem can be
defined as follows:
Input: a news article.
Output: whether the news article contains an infectious
disease event or not? If yes, extract information of the event.
In our research, an infectious disease event E is defined as

1

/> />3

4

2

140


Figure 1: Steps of disease event extraction















Figure 2: Event detector components







a tuple that has three elements:
E = <name, time, place>

(1)

where name is the name of the infectious disease mentioned
in the disease news article; time is the time when the disease
outbreaks; and place is a set of locations where the disease
appears.
We propose a process to extract the information of a disease event as illustrated in Figure 1. The extraction process includes five components: The crawler retrieves data
from the Internet; the pre-processing component extracts
the main content from the web pages returned by the crawler
(the detail of this module is described in Section and Table
3); the event detector decides whether a news article containing a disease event or not; the event extractor captures
the information of the event in a given news article (if any);
finally, the visualization component plots the disease events
on an online Geographic Information System (GIS) map.
In this paper, we strongly focus on two key components:
event detector and event extractor that are described in detail in Section 3.3 and Section 3.4.

3.3

3.3.1

We carried out a statistic on a large dataset of news articles from "Sức khỏe" (Health) 5 category of "Báo mới" news
website 6 to find out a set of frequent words (and phrases).

The number of frequent words is 34 and some of the most
frequent words are given in Table 1, where the third column
counts the number of articles containing the corresponding
words in the second column. We denote this set as Frequentwords set.
We recognize that most of news articles contain words in
the Frequent-words set relating to a disease event. Therefore,
our idea is to build semantic rules by combining words in
the Frequent-words set for filtering input data purpose. As
the result, we proposed two patterns named Pattern 1 and
Pattern 2 representing all our semantic rules. These patterns
are showed below:
Pattern 1 = noun phrase # verb phrase

Filtering Rules

As we mentioned above, the event detection component
has two modules: a data filter and a classifier, in which the
filter uses semantic rules to reduce news articles for later
classification. We examined the domain data carefully and
identified that most of news titles express their main content. It means that the title of a news article has enough
evidence to trigger the existence of a disease event. Therefore, we use rules to filter related disease news articles.

(2)

where noun phrase and verb phrase are in the Frequentwords set.
The Pattern 1 is illustrated in Example 1.
Example 1:
bệnh nhân tử vong # nhiễm (died patient # infected)
dịch tả # bùng phát (cholera # outbreaked)


Event Detector

The goal of Event Detector is to judge whether there is
a disease event from a given news article. When a news
article is given, it determines whether it contains a disease
event (EVENT) or not (NOT_EVENT) by using rules (for
title filtering) and machine learning (for classification). The
process of event detector is illustrated in Figure 2.
Event detector component consists of two modules: a data
filter and a classifier. The filter module receives data from
the pre-process component where HTML tags are removed
to get the main content. After that, this module filters disease news articles by checking their titles. Subsequently,
data is transferred into the classifier which distinguishes that
a news article contains an event or not.

Table 1: List of frequent-words
Word
Articles
Nhiễm (infect)
10005
Dịch (disease)
10000
Dương tính (is positive)
5269
Lây lan (spread)
4133
Bùng phát (outbreak)
4039
Tái phát (recurrence)
2514

Ổ bệnh (source of inflection)
2340
Ổ dịch (disease source)
1900
Dịch tả (cholera)
1853
Khử trùng (disinfection)
1143

No.
1
2
3
4
5
6
7
8
9
10

Pattern 2 = disease name # verb phrases

(3)

where:
• disease name is retrieved from the BioCaster Ontology [3] and The circular of the Ministry of Health of
Vietnam 7 , dated June 24th, 2011;
• verb phrases are in the Frequent-words set.
An example of a sentence matching Pattern 2 is given in

Example 2.
Example 2:
tiêu chảy cấp # nhiễm (acute diarrhea # infected)
tiêu chảy cấp # phát hiện (acute diarrhea # discovered)
tiêu chảy cấp # lây lan (acute diarrhea # spread)
tiêu chảy cấp # bùng phát (acute diarrhea # outbreaked)
tiêu chảy cấp # chết (tử vong) (acute diarrhea # died)
tiêu chảy cấp # dương tính (acute diarrhea # is positive)
Both the two patterns have two elements which are separated by the character "#". We built 43 rules from Pattern 1
by mixing 52 noun phrases and 10 verb phrases. Both these
5

/>
7
/>6

141


No.
1
2
3
4
5
6
7
8
9
10


Table 2: List of features
Feature
Dịch tay chân miệng (disease limbs)
Tiêu chảy (diarrhea)
Trẻ tử vong (the child died)
Ổ dịch (disease source)
Dương tính (is positive)
Dịch cúm gia cầm (bird flu)
Ca tử vong (deaths)
Bùng phát dịch (outbreak)
Dịch cúm (flu)
Bệnh nhân tử vong (the patient died)

Figure 3: Event extractor component













noun phrases and verb phrases are in the Frequent-words set.
Similarly, we used a disease name and a verb phrase to create a rule following Pattern 2. With 186 disease names from

the disease dictionary and 6 verb phrases in the Frequentwords set, the number of rules conforming to Pattern 2 is
186. Some verb phrases in Pattern 1 and Pattern 2 are the
same.
After building the rules set, we had 229 rules in total. The
related articles are retrieved by these rules and transferred
into the classifier.

3.3.2

Machine Learning Application

The classification model categorizes a news article into either EVENT or NOT_EVENT label. The investigation on
input data suggests that the title and abstract of a disease
news article have enough information to represent its content, therefore these elements are used to create the feature
vector. In the data preparation step, articles are manually
tagged with label (EVENT) and label (NOT_EVENT). After that, features are generated by using 2-grams, 3-grams,
and 4-grams. As the result, we retrieve 4,552 features which
are used for classification. Some features are showed in Table
2.
We used Maximum Entropy Model 8 as the classifier. The
news articles which are labeled EVENT will become the
input for the Event Extractor component.

3.4

Event Extractor

Event Extractor is one of two important components where
the information of a disease event is extracted. The event
extraction component is illustrated in Figure 3.

Event extraction includes three modules: time extraction,
disease extraction, and location extraction. The first module uses rules to extract the time information; the second
module utilizes a disease dictionary extracting the disease
information; and the final module combines NER and a location dictionary to capture place information. Finally, we
combine the extracted information to form a disease event
and store it in an event database.

3.4.1

Time Extraction

The investigation on dataset suggests that time information can be captured by rule and it is either absolute or
relative. In the absolute case, the time has the format of
DD/MM/YYYY, so we use Regular Expression (RE) to extract it. For the relative case, it always contains two elements: a prefix and the time. The prefix is a set of words
8

/>


that indicates relative time and the time is usually in the
Vietnamese date form of DD/MM/YYYY. Therefore, we
use a rule [1] to calculate the absolute time. The time rule
is showed in Formula (4).
TIME = <RELATIVE TIME> + <DATE TIME>
(4)
where:
• RELATIVE TIME = vào (on), ngày (date), sáng (morning), hôm nay (today), sáng hôm nay (this morning),
chiều (afternoon), hôm qua (yesterday), tối qua (yesterday evening), rạng sáng (early morning), tháng (month).
• DATE TIME has the format of DD/MM/YYYY which
is either the date expressed in the article content or the

published date.
Example 3 and Example 4 illustrate the use of Regular Expression and the time rule to extract the time information.
Example 3:
“Ngày 12/03/2012, Bộ Y tế công bố dịch cúm A H5N1 đã
tái phát tại Quảng Ngãi.” (On March 12th , 2012, Ministry
of Health announced the A H5N1 flu had hit Quang Ngai).
Example 4:
“Sáng ngày 15/01/2012, Sở Y tế Hà Nội thông báo bệnh
nhân đầu tiên nhiễm cúm A/H5N1 đã tử vong” (In the
morning of January 15th , 2012, Hanoi Health Department
announced the first patient who had infected with A/H5N1
flu died).
The time information in Example 3 is captured by the
Regular Expression while it is extracted by Formula (4) in
Example 4. As the result, the time information in Example 3
is March 12th , 2012, whereas it is In the morning of January
15th , 2012 in Example 4.

3.4.2

Disease Extraction

Disease extraction is the second module which captures
the disease name. As we mentioned in Figure 1, the preprocessing component tokenizes and word-segments the content of articles. As the result, each article has a list of words.
These words are input for this module. Disease extraction
module uses a disease dictionary including 186 disease names
for the extraction purpose.

142



The extraction process can be described in two steps: finding the longest phrase that can be a name candidate, and
matching the candidate with the original article to check
whether it is a correct name. The finding process uses the
longest matching method to match a word (in an article)
with a disease name (from the disease name dictionary). If
a disease name contains a given word, then it is probably the
disease name candidate. In the matching process, the candidate is checked whether it appears in the article to ensure it
is correct or not. The correct candidate must appear in the
original article. The disease extraction process is illustrated
through Example 5.
Example 5:
“Dịch cúm A/H5N1 bùng phát tại Bến Tre” (A/H5N1 flu
outbreaks in Ben Tre).
After tokenizing and word-segmenting, we retrieve two
words related to disease: cúm (flu) and A/H5N1. The finding step matches these words with the disease dictionary to
find out the longest word. As the result, with the word of
cúm (flu), we retrieve three words: cúm (flu), cúm A/H5N1
(A/H5N1 flu), and cúm gia cầm (bird flu), while with the
word of A/H5N1, we only have one name: cúm A/H5N1
(A/H5N1 flu). In the later step, the matching process checks
these words against the original article to find out correct
result. In this example, the longest item is cúm gia cầm
(bird flu), but it does not appear in Example 5. So this disease is ignored. The second longest word is cúm A/H5N1
(A/H5N1 flu) and the matching process recognizes that it
is in the original article. So, it is the correct disease name
and the value of the disease information is the cúm A/H5N1
(A/H5N1 flu).

3.4.3


Location Extraction

Building the final module is more challenging than two
previous ones because the ambiguity among locations. In
fact, several places can have the same proper name (e.g.,
"Dong Hai" town is a location in both "Tra Vinh" and "Quang
Ninh" provinces). Therefore, in some cases, if a news articles
does not mention locations clearly, the place information can
be confused. To deal with this issue, we combined NER and
a location dictionary to improve the performance of location
extraction.
Location extraction process can be described in three steps:
NER, location extraction, and normalization. Firstly, the
NER 9 was applied to detect location entities in a given
news article. As the result, locations in the article are labeled by a pair of <LOC> and </LOC> tags. Secondly,
we extract the locations based on these tags. In the final
step, each location is normalized by looking up the location
dictionary which will be described in detail later.
We used a location dictionary that is organized as a taxonomy which is showed in Figure 4, where:
• T is the abbreviation of the town
• C is the abbreviation of the commune
In this taxonomy, the highest level is the root node; level
1 represents 63 provinces; 692 districts are in level 2; and
11,101 towns and communes are represented by nodes in
the level 3. If a phrase inside the <LOC> and </LOC>
tag is matched with the value of a node, then current node
9




Figure 4: The location dictionary taxonomy

































is marked and complete location is the path from the current node to the root node. Obviously, this organization is
efficient to identify the relation between communes, towns,
and provinces and helps to avoid the geo-ambiguity. The efficiency of the taxonomy is showed in Example 6.
Example 6:
“Ngày 12/04/2013, Sở y tế Quảng Ngãi thông báo dịch
cúm A H5N1 đã bùng phát tại thị trấn Sông Vệ (On April
12th , 2013, Department of Health of Quang Ngai announced
a A H5N1 flu outbreak in Song Ve town).
This example mentions only the town where the A H5N1
flu outbreaks ("Song Ve" town), while the district and the
province are absent. In the process of location extraction,
this sentence is parsed by the NER, and "Song Ve" is labeled by the <LOC> and </LOC> tags, while "Quang
Ngai" is recognized as the organization entity (<ORG>).
As the common way, after retrieving the location (inside
<LOC> and </LOC>tags), "Song Ve" should be the location information. But "Song Ve" does not have enough information to become a real location on a GIS map, since it is
not complete. In order to solve this problem, we looked up
"Song Ve" in the location taxonomy. When the node having
this value is found, we traversed from this node to the root
node in the taxonomy to extract the complete information
(i.e. "Song Ve" town, "Tu Nghia" district, and "Quang Ngai"
province). This step is called location normalization.
Finally, the extracted time, disease name, and locations
from the article are combined to create an infectious disease event in which the set of locations found in this module
comprises the place component of the event. The event is
stored in an event database which is used for the visualization component in a real-time monitoring system.


4.
4.1

EXPERIMENTS AND RESULTS
Data Preparation

Our data is retrieved from "Báo mới" news website 10 because "Báo mới" automatically crawls a large number of news
articles (per day) from most of famous Vietnamese websites,
hence, it is a good data source. After crawling, we had a
dataset (denoted as raw dataset) of 3,842,137 news articles.
Elements of a news article (after pre-processing step) are
showed in Table 3.
After crawling the data, we used Pattern 1 (2) and Pattern
2 (3) to filter and got a set of 1,668 disease related news articles. We denotes the set of 1,668 articles as Filtered dataset
for later reference.
In our study, experiments are conducted on two important
components: Event Detector and Event Extractor, which are
10

/>
143


Table 3: News article’s elements
Element
Description
Title
The title of the article.
Abstract
The short paragraph what summaries

articles’ content.
Published time Time when the news is published. It
supports for time extraction process.
Link
The URL of the article.
Content
The content of the article.

Table 4: The error rate of the data filter module
Incorrect articles Total Error Rate (%)
175
486
36

described in detail in Section 4.2 and Section 4.3.

4.2

Data Filter Evaluation

The data filter is the first module in the event detection
which filters articles from the data crawler component. As
we mentioned above, this module uses Pattern 1 (2) and
Pattern 2 (3) to filter articles, so the performance of this
module depends on the coverage of rules of the two patterns. Normally, we must evaluate the precision of Pattern
1 and Pattern 2 on the whole dataset (about 3,842,137 news
articles), but this approach is very costly because we have
to label them manually.
To evaluate the performance of this module, we randomly
selected 486 articles from raw dataset to manually check the

error rate. The error rate was calculated using the Formula
(5), and the results are showed in Table 4. The results show
that the error rate is high or the accuracy is low due to the
fact that it filtered all the articles related to diseases in which
a large number of articles did not present disease events
(the detail of this issue will be discussed in Section 4.4). We
accept this to gain high recall, and the overall performance
will be improved by subsequent phases.
ErrorRate =

#incorrect
total

(5)

where:
• #incorrect is the number of articles which are not related to disease.
• total is total number of articles.

4.2.2

Fold
1
2
3
4
5
6
7
8

9
10
Avg

P
80,56
72,13
81,90
79,73
73,94
69,95
73,58
71,33
72,37
75,26
75,07

R
87,88
75,86
84,31
84,29
81,88
73,34
75,73
80,24
76,92
77,15
79,76


F-1
84,06
73,95
83,09
81,95
77,71
71,60
74,64
75,52
74,58
76,19
77,33

P
72,22
73,97
80,00
72,92
75,14
70,89
71,76
70,00
67,27
69,37
72,35

R
76,47
79,41
83,81

78,36
78,98
76,65
75,20
75,51
80,57
73,48
77,84

F-1
74,29
76,59
81,86
75,54
77,01
73,66
73,44
72,65
73,32
71,36
74,97

Event Detection Evaluation

As we mentioned above, the Event Detector has two modules named the data filter and the classifier. Therefore, we
will evaluate performance of this component based on these
two modules.

4.2.1


Table 5: The comparison of Experiment a and Experiment b
Experiment a
Experiment b

Classification Evaluation

We carried two experiments to evaluate performance of
the classification, namely, Experiment a which combines rules
and machine learning, and Experiment b which uses only
machine learning. The measures used to evaluate the performance of this modules are precision, recall, and F-score
based on the 10-fold cross validation.
In the Experiment a, we randomly selected 686 articles
from the Filtered dataset and tagged them as EVENT or

NOT_EVENT. We denoted this set as Experiment a dataset.
In the Experiment b, we selected 50 more articles from the
raw dataset, and added them to Experiment a dataset to
form Experiment b dataset.
After preparing the training dataset, we compare the performance of the two experiments. The comparison of two
classifiers is showed in Table 5 where the results of Experiment b are in three columns on the right, while the results
of the Experiment a are showed in three columns on the left.
The average of F-score in the two experiments indicates that
the F-score of classifier in the Experiment a is better than
that of the Experiment b of ≈2,36%. The difference between
two classifiers is not big because we added only 50 articles into the Experiment a dataset. The performance will be
much better if we add more raw articles into Experiment b
dataset.

4.3


Event Extraction Evaluation

Because an infectious disease event E is defined as a tuple
that includes name, time, and place as given in Formula (1),
so a correct event should completely contain all 3 elements.
When the time of an event is not clearly mentioned in the
text, we use the published date of the article as the time of
the event. In other cases, if a disease event does not include
either a disease name or locations, then it is considered to
be a false event.
To evaluate the precision of the event extraction step, we
carried out two experiments, namely, Experiment c (abbreviated as Expr c) which uses rules, and Experiment d (abbreviated Expr d) which uses both rules and NER. The dataset
used in both experiments is 152 news articles which were selected from the articles set returned by the event detector.
We use three measures Precision (P), Recall (R), and Fscore (F) to compare the performance of the two experiments. These measures are denoted by Formula (6), (7), and
(8) as following:
P =

#correct
#correct + #incorrect

(6)

where:
• # correct is the number of correct disease events
• # incorrect is the incorrect disease events

144


Table 6: The comparison of Experiment c and Experiment d

Name Correct Incorrect
P
R
F
Expr c
127
25
83,55 92,02 87,58
Expr d
136
16
89,47 94,44 91,89

R=

#correct
#correct + #not_f ound

(7)

where:
• # correct is the number of correct disease events.
• # not_found is the number of disease events which
the model did not recognize
F =

2×P ×R
(P + R)

(8)


Based on the Formula (6), (7), and (8), we compare the
precision of Experiment c and Experiment d. The comparison is showed in Table 6, where the second row is the result
of Experiment c whereas the third row is the result of Experiment d. In the Experiment c, the F-score is ≈87,58%, while
it is ≈91,89% in the later experiment. The result shows that
the precision of Experiment d improves by ≈4,31% in comparison with that of Experiment c. The cause of the difference between two experiments will be explained in the next
section.

4.4

Error Analysis and Discussion

In the Event Detector component, the results in Table 4
suggest that there is confusion in the data filter module. To
find out the cause of confusion, we manually checked articles
which were selected from the dataset used in Section 4.2.1.
The analyzed results indicate that in cases of error, some
rules of Pattern 1 (2) and Pattern 2 (3) are not efficient to
filter articles. The reason is that several topics can share a
verb. For instance, verb phrase "tử vong" (die) may belong
to either disease or treatment topics. If this verb appears
in an article, the data filter module considers this article
related to a disease event, however in fact, it is a treatment
topic as illustrated in Example 7.
Example 7:
Uống thuốc hạ sốt sau 30 phút bệnh nhân tử vong (The
patient died after having had the fever medication for 30
minutes).
This sentence is captured by a rule of Patten 1 (2) of "bệnh
nhân # tử vong" (patient # died), but in fact, the cause of

death is related to the medication instead of a disease.
Moreover, some rules of Pattern 2 (3) (which is a combination of a disease name and a verb phrase) confuse the disease
event with a topic related to a disease as showed in Example
8.
Example 8:
"Phát hiện chủng virus mới gây bệnh tay chân miệng" (A
new strain of virus causing the hand, foot, and mouth disease
has been discovered).
The rule of Pattern 2 (3) of "tay chân miệng # phát hiện"
(hand, foot, and mouth # detect) captures this sentence,
but it mentions the discovery of a new virus strain instead
of a disease event.
For Event Extractor component, the results in Table 6 indicate that the precision of Experiment d is ≈5,92% higher

than that of the Experiment c. At first, we were surprised
with the comparative result, because, the Experiment c uses
rule-based method to capture information of an event. Normally, using rules (knowledge-driven method) often gets highly
accuracy.
To find out the source of errors appearing in the event extraction, we manually checked the incorrect articles in the
two experiments (mentioned in Section 4.3). The investigated results are showed in Table 7 and Table 8, respectively.
The statistic from Table 7 and Table 8 indicates that the
cause of errors in both experiments originated from the location extraction and, in some cases, from the diseases extraction. In the Experiment c, we recognized that the rules
which are used to extract locations did not cover all cases. In
a few cases, if the location information is abbreviated, then
the rules can not recognize them as illustrated in Example
9.
Example 9:
“Phát hiện một trường hợp bệnh nhân nhiễm cúm A H5N1
tại P.7, Q.8, TP. HCM.” (We discovered a patient who
infected A H5N1 flu in ward 7, district 8, HCM city).

In this example, ward 7, district 8 and Ho Chi Minh city are
abbreviated as (P.7, Q.8, TP. HCM ), therefore, the rules
can not recognize location information.
In the Experiment d, the main cause that reduced the precision of location extraction is the performance of NER tool.
In a few cases, the it did not detect locations exactly because
the abbreviation of places in articles (similar to the rulebased method). In some other cases, the it mis-recognized a
location as an organization as showed in Example 10.
Example 10:
“Ngày 12/03/2012, dịch tiêu chảy cấp đã bùng phát tại
Hà Nội, Hải Phòng, Quảng Ninh, Bến Tre, Cần Thơ.”
(On December 3rd , 2012, cholera outbreaked in Hanoi, Hai
Phong, Quang Ninh, Ben Tre, and Can Tho).
In this example, Hanoi, "Hai Phong", "Quang Ninh", "Ben
Tre", "Can Tho" are recognized as organizations (tagged
with <ORG> and </ORG> pairs) which would be ignored
during processing.
In both Experiment c and Experiment d, some extracted
disease names were incorrect, because they are not in the
disease dictionary. Moreover, the disease dictionary contains
some names which are equivalent to the symptoms of a disease. Thus it makes confusion for the disease extraction
module. For instance, in the Table 7, a disease name of
A/H1N flu in the 89th article is detected as pneumonia,
while pneumonia is a symptom of the A/H1N flu.
In addition, there are some factors which have bad effect
to the event extraction. Firstly, typo errors of the location in
articles reduces the performance of the location extraction.
For instance, "Đắk Lắk" is written as "Đắc Lắc", but "Đắc
Lắc" does not appear in the location dictionary. Therefore,
the location information can be missed. Secondly, if some
locations are not described clearly such as “các huyện phía

Tây của tỉnh Bến Tre” (the western districts of "Ben Tre"
province), then the NER utility can not recognize them.
Finally, another important cause is the geo-ambiguity that
reduces the precision of event extraction component. In fact,
one proper name can be named for several places, if the
disease news articles do not mention the places clearly, the
location information can be confused. The geo-ambiguity is
showed in Example 11:
Example 11:

145


No.

Doc ID

1
2
3

4
7
13

4

17

5

6
7
8
9
10
11
12
13

24
26
32
64
65
79
89
92
96

14
15

105
108

Table 7: The errors in Experiment c (15 of 25 errors)
Error Detail
Correct Information
Extracted Information
Congo

NULL
Kon-Plong Ly District, Pray Veng Province
NULL
Ward 6, District 8, Ward 14, Ho Chi Minh City
District 5, District 8, Ward 7, Binh Thanh District,
Hoc Mon District
zones 1, Ngo Dong town, Giao Thuy District, Nam Nam Đinh
Đinh
Hand, foot and mouth
Dengue
Ward 8, District 5, Ho Chi Minh City
Long Bien District
Ward 7, District 8, HCM City
NULL
Typhus
Dengue
group 3, Tran Hung Dao Ward, Kon Tum City
Da Nang
Hanoi
NULL
A/H1N1 flu
Pneumonia (Symptom)
A/H1N1 flu
Tuberculosis
Ea T’ling towns and communes: Nam Dong, Tam NULL
Thang, D’Dak Rong
Cholera
Acute diarrhea
Tam Quan commune, Tam Dao province, Quan Noi, Tam Quan commune
Quan Ngoai, Lang Chanh village, Lang Mau villiage,

and Nhan Ly

“Ngày 05/10/2012, Sở Y tế Quảng Ninh thông báo đã phát
hiện vi khuẩn tả tại thị trấn Đông Hải” (On May 10th , 2012,
Quang Ninh Department of Health announced the detection
of cholera in the Dong Hai town).
In this example, "Dong Hai" town is a location in both
"Tra Vinh" and "Quang Ninh" provinces, but the article only
mentions Dong Hai town, so the module failed to decide
whether the disease outbreak was in "Quang Ninh" or "Tra
Vinh"?
Another error source came from the incomplete recognition of location, i.e. only some parts of a location was detected as shown in row 4 of Table 7 (where only the Nam
Dinh province was detected), and row 11 of Table 8 (where
only Binh Duong province was recognized).
The last error source originated from the case in which a
location mentioned in the text was not the outbreak place.
This made the location module misunderstand, and extract
the incorrect information as depicted in row 9 of Table 7
and row 8 of Table 8.

5.

CONCLUSION

In this paper, we introduced our method that combines semantic rules and machine learning to extract disease events
in Vietnamese webpages. The results of experiments illustrated that our method is suitable for extracting disease
events in the Vietnamese. Furthermore, we have described
briefly our system process, especially we emphasize two key
components: Event Detector and Event Extractor. We plan
to integrate the event database into Vn-Loc system 11 where

user can follow some event types: FIRE, CRIMINAL, and
TRANSPORT ACCIDENT.
However, our method needs to have some improvements
to enhance the performance in the future. Firstly, the coverage of semantic rules and the performance of the Maximum
11

/>
Entropy classifier must be enhanced by adding useful information. Secondly, the precision of event extraction can
be increased by improving the performance of NER tool.
Besides, the geo-ambiguity and the confusion between diseases and symptoms should be improved. Finally, relations
between disease events should be considered to enhance the
quality of the monitoring system.

6.

REFERENCES

[1] Mai-Vu Tran, Minh Hoang Nguyen, Sy-Quan Nguyen,
Minh-Tien Nguyen, and Xuan-Hieu Phan. "VnLoc: A
Real - Time News Event Extraction Framework for
Vietnamese". KSE, pp.161-166, 2012.
[2] Hogenboom Frederik, et al. "An Overview of Event
Extraction from Text", Workshop on Detection,
Representation, and Exploitation of Events in the
Semantic Web (DeRiVE 2011) at Tenth International
Semantic Web Conference (ISWC 2011). Vol. 779.
2011.
[3] Collier Nigel, et al. "An Ontology-driven System for
Detecting Global Health Events". In Proceedings of
the 23rd International Conference on Computational

Linguistics (pp. 215-222). Association for
Computational Linguistics.
[4] Volkova Svitlana, et al. "Animal Disease Event
Recognition and Classification". Proceedings of the
First International Workshop on Web Science and
Information Exchange in the Medical Web (MedEx
2010). 2010.
[5] Doan S., Hung-Ngo Q., Kawazoe A., and Collier N.,
"Global Health Monitor - a Web-based System for
Detecting and Mapping Infectious Diseases". Proc.
International Joint Conference on Natural Language
Processing (IJCNLP), Companion Volume,
Hyderabad, India, January 7-12, pp.951-956, 2008.
[6] Freifeld Clark C., et al. "HealthMap: Global Infectious

146


No.

Doc ID

1

16

2
3
4
5


17
21
23
25

6

26

7
8

32
39

9
10

40
45

11

46

12
13
14


47
69
84

15
16

106
109

[7]

[8]

[9]

[10]

[11]

Table 8: The errors in Experiment d
Error Detail
Correct Information
Extracted Information
Thanh Long village, Phuoc My commune, Quy Binh Dinh
Nhon City
Giao Thuy district, Nam Dinh, A (H5N1) flu
Nam Đinh, Flu
Me So, Van Giang, Hung Yen
Hung Yen

Ba Ria - Vung Tau
NULL
4 village, Hoa An commune, Krong Pac district, Dak Hoa An Commune, Chiem Hoa district, Tuyen
Lak
Quang
Ward 8, District 5, Ho Chi Minh City (P.8, Q.5, TP. NULL
HCM)
Ward 7, District 8, HCM City (P.7, Q.8, TP. HCM) NULL
Mo Cay Nam, Mo Cay Bac, Giong Trom, Thanh Ben Tre
Phu, Chau Thanh Ba Tri, Cho Lach
Ward 6, District 8 (P.6, Q.8)
TP. HCM
Hung Yen, Yen Dinh, Thanh Hoa, Vinh Phuc, Ba Hanoi, Vinh Phuc
Dinh, Hanoi
Thuan An, Di An, Ben Cat District, Thu Dau Mot Binh Duong
Town, Binh Duong
Kim Long and Huong Long Ward, Hue City
NULL
Tan An Hoi Village, Cu Chi District, HCM City
NULL
Thanh Binh Ward, Hai Chau district, Da Nang city, Ward Thanh Binh, Ninh Binh City, Ninh Binh City
Dak Lak
Da Nang, Hai Chau District
District 7, Tan Binh District
1 District
Hoang Mai District, Hai Ba Trung, Thanh Xuan, Hanoi
Hoan Kiem District, Thanh Tri, Dong Da, Quang
Ninh, Bac Giang, Nam Dinh, Thai Binh, Ha Nam,
Hung Yen


Disease Monitoring through Automated Classification
and Visualization of Internet Media Reports".
Journal of the American Medical Informatics
Association 15.2 (2008): 150-157.
Doddington George R., et al. "The Automatic Content
Extraction (ACE) Program – Tasks, Data, and
Evaluation". LREC. 2004.
Grishman Ralph, Silja Huttunen, and Roman
Yangarber. "Real-Time Event Extraction for Infectious
Disease Outbreaks". Proceedings of the second
international conference on Human Language
Technology Research. Morgan Kaufmann Publishers
Inc., 2002.
Grishman Ralph, Silja Huttunen, and Roman
Yangarber. "Information extraction for enhanced
access to disease outbreak reports". Journal of
Biomedical Informatics (JBI), Vol. 35, No. 4,
pp.236-246, 2002.
Allan James, Ron Papka, and Victor Lavrenko.
"On-line new event detection and tracking".
Proceedings of the 21st annual international ACM
SIGIR conference on Research and development in
information retrieval. ACM, 1998.
Grishman Ralph, and Beth Sundheim. "Message
understanding conference-6: a brief history"
COLING, Vol. 1, pp.466–471, 1996.

147




×