Tải bản đầy đủ (.pdf) (6 trang)

DSpace at VNU: VnLoc: A real-time news event extraction framework for Vietnamese

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (642.01 KB, 6 trang )

2012 Fourth International Conference on Knowledge and Systems Engineering

VnLoc: A Real–time News Event Extraction Framework for Vietnamese
Mai-Vu Tran∗


Minh-Hoang Nguyen∗


Minh-Tien Nguyen∗∗



Sy-Quan Nguyen∗


Xuan-Hieu Phan∗


KTLab, Faculty of Information Technology, College of Technology
Vietnam National University, Hanoi (VNU)
Hanoi, Vietnam
∗∗

Faculty of Information Technology,
Hung Yen University of Technology and Education, Hungyen, Vietnam

Abstract

know news detail without reading entire news content. In
addition, result of extracting event can be used in online


monitoring system where user can catch information easily.
Recently, event extraction topic has received more attention from scientists in Natural Language Processing and
Data Mining around the world. In 1987, event extraction
have become a main topic in Message Understanding Conference (MUC) [5]. In this conference, an event was defined: an event must have actor, time, place and impact on
the surrounding environment. In addition, Automatic Content Extraction program gave definition: event is an activity was created by participants and divided event into
eight types: Life, Movement, Transection, Business, Conflict, Contact, Personnel and Justice. As Allen’s definition:
an event includes four attributes: modality, polarity (Positive, Negative), genericity (Specific, Generic), tense(Past,
Present, Future, Unspecified) [1].
Based on investigation and analysis of meaning of event
extraction, we have proposed a event extraction method for
Vietnamese language and building event online monitoring
system named VnLoc. The method we proposed which is a
combination between lexico–semantic and machine learning. Data of system are gathered from news through the
RSS feeds. Then, we apply our method which was proposed
to classify event into two categories: EVENT or NONEVENT upon tiding’s title. After that, we extract event attributes from events which are classified and the last result
is visualized on online map. In experiment phase, we evaluated our approach by cross–validation method based on
precision (≈92.85%), recall (≈90.39%) and F1 measured
(≈ 91.61%).
In section 2, we will mention to related researches,

Event Extraction is a complex and interesting topic in
Information Extraction that includes event extraction methods from free text or web data. The result of event extraction
systems can be used in several fields such as risk analysis
systems, online monitoring systems or decide support tools
[4]. In this paper, we introduce a method that combines
lexico–semantic and machine learning to extract event from
Vietnamese news. Furthermore, we concentrate to describe
event online monitoring system named VnLoc based on the
method that was proposed above to extract event in Vietnamese language. Besides, in experiment phase, we have
evaluated this method based on precision, recall and F1

measure. At this time of experiment, we on investigated on
three types of event: FIRE, CRIME and TRANSPORT ACCIDENT.

1. Introduction
The information explosion and development of Information Technology–Communication is good condition for
people reach information easily. Therefore, information are
more and more rich and diversified. Information from different sources (newspaper, blog, social network, . . . ) is main
cause of chaos information. Thus, extracting useful information that reader interested in from the daily news is really
necessary. One of the biggest problem we are facing is how
to get information as fast as possible in the shortest time!
This is really challenge in Web Mining. This issue can be
answered by event extraction because information usually
contain event content. Through event extraction, reader can
978-0-7695-4760-2/12 $26.00 © 2012 IEEE
DOI 10.1109/KSE.2012.34

161


Table 1: Features Description
Feature
χ2 weight
Freqt
chém (cutting)
0.70329136
240
giết (kill)
0.6890592
530
cháy (fire)

0.5872597
201
gây tai nạn(crashed) 0.5106312
374

whereas section 3 will describe more detail our method and
event online monitoring system VnLoc. Section 4 illustrates
our experiment and evaluates result on real data. The last
section is conclusion.

2. Related Work
In [7], Ralph Grishman et al. investigated on Maximum
Entropy to detect event. They used three classifiers for individual task which are argument classifier, role classifier,
event classifier. Moreover, event coreference is also solved
by another Maximum Entropy classifier using features such
as the event type, the event subtype, the anaphor anchor and
the distance between anaphor and anchor. In other study,
Heng Ji and Ralph Grishman explored Maximum Entropy
to identify events of a separate type [6]. It is a sentence–
level classifier which processes each sentence in the document and attempt determine event type.
Lexico–semantic patterns can be used for various purposes in many domains. Cohen and Verspoor et al [3] applied semantic rules as patterns to extract event in biological area. They divided biological events into six types: binding, gene expression, localization, phosphorylation, protein catebolism and transcription. Biological events are extracted through patterns which each pattern is a set of semantic words. In other word, Jethro Borsje et al [2] proposed using lexico–semantic patterns to detect financial
event from RSS news feeds. These patterns were organized
in financial ontology named OWL. Each pattern has a triple
format and includes three elements: subject, a relation and
optional subject.
In addition, there are several systems which extract
events from online news in other domain. Collier et
al made BioCaster system where we can follow several
event types around the world (). Besides, HealthMap system was built by Freifeld and Brownstein where user can monitor diseases types over the
world (). By the way, Frontex

system was developed by Atkinson and Piskoski et al
() for monitor Europe agency.

Semantic label
CRIME
CRIME
FIRE
TRANSPORT ACCIDENT

stores details information of news that is gathered. The second one stores events information which are extracted by
event extractor. Both of them are organized in MongoDB
system to attained the important key: high scalability. The
last one embraces several corpora which support for both
machine learning process and extraction process. Next, the
news crawler fetches tidings through RSS resources which
are supplied by many websites such as VnExpress1 , VietNamNet 2 , DanTri 3 . By the structure of RSS format, useful
information can be extracted from individual feed by a XML
parser and be saved in news database in data repository. Furthermore, the visualizer is described as a Map where shows
event on web interface. Data is pulled from event database
and pushed to Google Map API with some modifications
and will be represented. Following, two main elements including event detector and event extractor will be explained
to make VnLoc system clearer.

3.2. Event Detector
When a news is gathered, it is determined by the event
detector to detect event inner news. To settle this task, we
used a binary calssification approach which is Maximum
Entropy method. We examined domain carefully and identified that the most of news’ titles express their content evidently. Therefore, our problem is sentence level classification. The first job, set of features is chose based on χ2
weight on offline data that is gathered before. Simultaneously, N–grams method is also utilized to select phrases as
features. In this paper, we choose Uni–gram, Bi–grams and

Tri–grams as three phrase types. Moreover, feature is tagged
with a semantic label to enhance its meaning. The table 1
shows some examples. After that, Maximum Entropy classifier is applied to divides set of titles into two categories:
EVENT and NON-EVENT. This job is pre–condition for
event extractor in the next phase.

3. Implemented System
3.1. System Architecture

3.3. Event Extractor

VnLoc is an event monitoring system that is horizontal
scalable and distributed. Its architect is illustrated in figure
1. VnLoc consists of six components: a scalable data repository, a news crawler, an event detector, an event extractor, a
plugin engine and a visualizer as web–based.
We organize the data repository into three parts: a news
database, an event database and a data corpus. The first one

In the second important part, event and its information
such as time, place, participants will be extracted from news
1 www.vnexpress.net
2 www.vietnamnet.vn
3 www.dantri.com

162


Figure 1: VnLoc’s Architecture
which is predicted that contains circumstance by the event
detector. Our approach is very clear and knowledge driven.

A lot of rules would be generated and exploited on the rumours which are passed from previous phase. In this paper,
we use 7 types of rules for our aim.
To take out event, the rules 1, 2, 3, 4 are applied:

CRIME := ẩu đả (brawl)
băng cướp (bandits)
bị đâm chết (stabbed)
. . . (90 rules)

• FIRE
< P RE > < F IRE > < P OST >

ACCIDENT := đâm xe (car crash)
cán chết (crashed)

(1)

• CRIME
< P RE > < CRIM E > < P OST >

lật tàu (boat capsized)
. . . (27 rules)

(2)

• TRANSPORT ACCIDENT

DAMAGE := thiệt mạng (die)
chết thảm (pitiful death)
chấn thương sọ não (brain injury)

. . . (22 rules)

< P RE > < ACCIDEN T > < P OST > (3)

< P RE > < DAM AGE > < P OST >

(4)

To pull out time when event happened, we reach two
methods: direct and indirect. The former is in situation that
the time is showed completely by circumstance’s content,
we use regular expressions to accomplish this task. The
latter comes when the time is not concreate. For instance,
"Hôm nay, hai vụ tai nạn giao thông đã
xảy ra trên đường Khuất Duy Tiến." ("Today, two transport accidents happened on Khuat Duy Tien

With PRE and POST are phrases or words surrounding keywords.
FIRE := vụ cháy (fire)
bùng cháy (burning)
cháy rụi (burned)
. . . (18 rules)

163


Street."). In this example, Hôm nay (Today) is a relative
adverb that does not denote the time exactly when the event
occured. We solved this problem by matching based on a
dictionary which contains relative key and relative value as
definition below. Then, rule 5 is used to extract time.

RELAT IV ET IM E = (RELAT IV E_W ORD, BIAS)
= {(hôm nay (today), 0),
(rạng sáng nay (this morning), 0)
(hôm qua (yesterday), −1),

4

Experiment and Result

Our experiment process was conducted on data set that
includes 18.400 titles which are extracted from 3.842.137
news titles of BAOMOI 4 through RSS news gathering.
News components are illustrated in table 2. Besides, we

,

(hai ngày trước (two days ago),
− 2)}

Element
Title
Abstract
Publish time

< DAT E >=< P U BLISH · T IM E > + < BIAS >
(5)
Next step, we extract location where event occured. As
mentioned in rule 6, we used two constituents to find out
proper location. The first is LOCPREP, which is a set of
prepostions coming before right place; and the second is

LOCPREFIX, which is a set of prefixes coming after prepositions above but coming before right place. After, we applied the rule 6 to perform this task.

Link

have evaluated event detection via evaluate event classification process by using cross validation (10 fold cross validation). Testing data set is separated to 10 testing patterns
with rate 9:1, 9 parts are used as training data set and 1 part
used as testing data. Result of classification is illustrated in
table 3 and chart 2.
Table 3: Result of classification
Precision Recall
F1
Fold 1
92.70
89.23 90.93
Fold 2
93.08
91.39 92.23
Fold 3
93.32
91.54 92.42
Fold 4
93.32
91.54 92.42
Fold 5
93.68
91.78 92.72
Fold 6
93.50
91.60 92.54
Fold 7

92.95
90.81 91.87
Fold 8
92.39
89.01 90.67
Fold 9
91.81
88.65 90.20
Fold 10
91.68
88.51 90.07
Average
92.85
90.39 91.61

LOCP REP = {ở (in), tại (at) , trên (in),
gần (nearby, near), trong (into)}
LOCP REF IX = {thành phố (city), tỉnh (province),
quận (district), thị xã (town),
xã (village), phố (street)}
LOCAT ION = {loci |loci ∈ location dictionary}

< LOCP REP >< LOCP REF IX >< LOCAT ION >
(6)
Finally, participants is considered. As the same prior
event information, we also use rule that is shown at 7.

< P RE > < P ERSON >

Table 2: News’ elements

Description
News’ headline.
The short paragraph what summaries
news’ content.
Time when the news is published.
Maybe support for time extraction process.
Link to origin news.

After the system operates online at (figure 3), we evaluated result of event extraction process by
manual task through checking each event is showed on system from 13/04/2012 to 22/04/2012. The statistics precision
of event extraction is showed in table 4. Based on articles
detected that contain events, the statistics in table 4 presents
that event extraction strategy using lexico–semantic and machine learning is appropriate in Vietnamese news. In some
cases, extracting event process is false because it relates
to ambiguity of places where many locations have similar
names whereas article does not mention position fully.

(7)

PRE := ông (Mr)
bà (Mrs/Ms)
gia đình (family)
nghi can (suspect)
bị cáo (defendant)

4 />
164


Figure 3: VnLoc at


Table 4: Event extraction result (in quantity)
Date
Extracted Correct Precision(%)
13/04/2012
47
43
91.49
14/04/2012
61
58
95.08
15/04/2012
65
59
90.77
16/04/2012
59
54
91.52
17/04/2012
48
43
89.58
18/04/2012
55
49
89.09
19/04/2012
71

64
90.14
20/04/2012
56
50
89.28
21/04/2012
60
54
90.00
22/04/2012
63
57
90.47

Figure 2: Result of classification

5

Conclusion

tributes are extracted based on rules and it is visualized on
online map.
Furthermore, we have described in detail the system architecture. Especially, we concentrated describing activity
of Event Detector component which uses the method was
proposed to recognize an event, and Event Extractor which
uses lexico–semantic rules to extract event’s attributes.
Although we have achieved good result, system need to
have some improvements to enhance quality in the future.
Firstly, the precision of Maximum Entropy classifier must

be enhanced the by adding useful information. Secondly, we
aim to expand some areas such as disaster (disease, earthquake, tsunami), culture and finance. Therefore, an ontol-

In this paper, we have represented a method that combines lexico–semantic and machine learning (Maximum
Entropy) for event extraction on Vietnamese domain data
and described VnLoc system.
Through the result of experiment have demonstrated
combining lexico–semantic and machine learning will
achieve good result in Vietnamese domain data. Maximum
Entropy machine learning method is used for binary classification and only keeping events that are suitable with
features in training data set. Lexico–semantic is applied to
take out useful information of event. Eventually, event’s at-

165


ogy is building to integrates easier some plugin modules for
purpose above.

6

Acknowledgement

This research work was partly supported by The
National Major Research Program KC.01/11-15 (code
KC.01.TN04/11-15) under project "Analyzing opinion’s
trend based on social network and its application in tourism
and technology products".

References

[1] J. Allan, R. Papka, and V. Lavrenko. On-line new event detection and tracking. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development
in information retrieval, SIGIR ’98, pages 37–45, New York,
NY, USA, 1998. ACM.
[2] J. Borsje, F. Hogenboom, and F. Frasincar. Semi–automatic
financial events discovery based on lexico–semantic patterns.
Int. J. Web Eng. Technol., 6(2):115–140, Jan. 2010.
[3] K. B. Cohen, K. Verspoor, H. L. Johnson, C. Roeder, P. V.
Ogren, W. A. Baumgartner, Jr., E. White, H. Tipney, and
L. Hunter. High-precision biological event extraction with
a concept recognizer. In Proceedings of the Workshop on
Current Trends in Biomedical Natural Language Processing:
Shared Task, BioNLP ’09, pages 50–58, Stroudsburg, PA,
USA, 2009. Association for Computational Linguistics.
[4] U. K. F. D. J. Frederik Hogenboom, Flavius Frasincar. An
overview of event extraction from text. Workhop on Detection,
Representation, and Exploitation of Events in the Semantic
Web, 2011.
[5] R. Grishman and B. Sundheim. Message understanding
conference-6: a brief history. In Proceedings of the 16th conference on Computational linguistics - Volume 1, COLING
’96, pages 466–471, Stroudsburg, PA, USA, 1996. Association for Computational Linguistics.
[6] H. Ji and R. Grishman. Refining event extraction through
cross-document inference. In Proc, 2008.
[7] D. W. Ralph Grishman and A. Meyers. Nyu’s english ace
2005 system description. ACE Program, 2005.

166




×