DSpace at VNU: An integrated approach using conditional random fields for named entity recognition and person property extraction in Vietnamese text

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (161.83 KB, 4 trang )

2011 International Conference on Asian Language Processing

An Integrated Approach Using Conditional Random Fields
for Named Entity Recognition and Person Property Extraction in Vietnamese Text

Hoang-Quynh Le, Mai-Vu Tran, Nhat-Nam Bui, Nguyen-Cuong Phan, Quang-Thuy Ha
KTLab, College of Technology
Vietnam National University, Hanoi (VNU)
Hanoi, Viet Nam
E-mail: {lhquynh, vutranmai, nambn_52, cuongpn_52, hqthuy}@gmail.com
property extraction based on CRFs using a rich Vietnamese
feature set we proposed.
The remaining of this paper is organized as follows.
Firstly, we present some related works in section II. Section
III mentions the machine learning method CRFs and its
application in our problem. Then, in section IV, we explain
our proposed model and design a rich feature set from using
various kinds of knowledge resource. In the next section, we
present experimental results and offer some discussion.
Finally, section VI is the conclusions.

Abstract—Personal names are among one of the most
frequently searched items in web search engines and a person
entity is always associated with numerous properties. In this
paper, we propose an integrated model to recognize person
entity and extract relevant values of a pre-defined set of
properties related to this person simultaneously for
Vietnamese. We also design a rich feature set by using various
kind of knowledge resources and a apply famous machine
learning method CRFs to improve the results. The obtained
results show that our method is suitable for Vietnamese with

the average result is 84 % of precision, 82.56% of recall and
83.39 % of F-measure. Moreover, performance time is pretty
good, and the results also show the effectiveness of our feature
set.

II.

In the past few years, this research topic has received
considerable interest from the NLP community.
From 2007 to 2010, Web People Search Campaigns 1
(WePS) [4, 10] had aimed at searching for people on the
web. These compagnes series had contributed many
important researches on properties extraction. The first
WePS introduced a name disambiguation task and found that
properties such as date of birth, nationality, affiliation,
occupation, etc. are particularly useful as features to identify
namesakes [10]. Consequently, in the second WePS, a
property extraction subtask was introduced [10] and it was it
was continue to consider in the third WePS [4]. This subtask
in WePS-2 is to extract 18 “attribute values” of target
individuals whose names appear on each provided Web
pages. The problem was solved by a combination of many
technologies, such as named entity recognition and
classification, text mining, pattern matching, relation
discovery, information extraction and more. However, the
result on the test set with 2,883 documents is quite low, the
highest result having F value = 12.2 is PolyUHK system
[10]. The WePS-3 attribute extraction task is different from
WePS-2 in that systems are requested to relate each attribute
to a person (cluster of documents). System with best results

had F-measure = 0.18, precision = 0.22 and recall = 0.24 [4].
The results of WePS 2 also show that some properties
have higher frequency than others, such as work, occupation,
and affiliation [10]. Based on the most frequent properties of
WePS-2, we use 10 types of properties for experiments, they
are: other name, date of birth, date of death, birth place,
death place, sex, occupation, nationality, affiliation and
relatives.

Keywords- person named entity; property relation; property
extraction; person property extraction; conditional random fields

I.

INTRODUCTION

Nowadays, personal names are one of the most
frequently searched items in web search engines and a
person entity is always associated with numerous properties
(also called attributes) [4, 10]. Property is characteristic or
quality of an entity and property extraction is the extraction
of property corresponding to an entity from text [3]. In
person property extraction, we predefined a fixed set of
property types and try to extract these property values for a
person in text. Properties extraction for a particular person is
important to uniquely identify that person on the web.
Consequently, extracting various properties has shown to be
useful for personal name disambiguation [10]. Property
relation extraction also is used in object/entity analysis in
text and plays an important role for expanding databases and

ontology.
A system that attempts to extract person properties from
text must solve several sub-problems: named entity
recognition, languages ambiguity, grammar complexity, etc.
Named entity recognition (including person name, location,
date, etc.) is a mandatory pre-processing for properties
extraction. Doing them in turn would require much effort,
moreover because these two problems have many similar
features, so the pipe-line model might repeats some step
twice.
In this paper, we focus on recognizing person entity and
extracting properties related to this person in Vietnamese
text. Our model integrate entity recognition and person

978-0-7695-4554-7/11 $26.00 © 2011 IEEE
DOI 10.1109/IALP.2011.37

RELATED WORKS

1

115

/>

decided to use it to resolve our sequential labeling problem:
Assume X = (x1 ... xT) is the input sentence, consists of T
words, we must determine sequence of tags Y =(y1 ...yT). We
used tags set include 43 tags of 21 labels (key person entity,
10 property types and 10 property values). In each tag, B

denotes the beginning of a label and I denotes inside of a
label.

In 2008, Banko and Etzioni built a system called O-CRF
[6] used the components between two entities to discover
their relation with a precision of 88.3% and a recall of
45.2%. The effectiveness of using CRF for relation
extraction in this system is one of the reasons why we choose
to apply this method in our model.
The integration of two sub-problems in NLP received
some interest from the NLP community. Two problem had
integrated in many works are word segmentation and POS
tagging, almost of them achieved positive results (e.g.
research of Tran Thi Oanh et. al. [11] in 2010 for
Vietnamese). Because both named entity recognition and
person properties extraction can be solved as a sequential
labeling problem, we proposed an approach model to
integrate them.
There are quite lot researches of semantic relations in
Vietnamese, but among them, number of works studies on
property relation is few. In 2010, Rathany Chan Sam et. al.
[8] developed a relation extraction system for Vietnamese
person name and other entities based on CRFs, average F of
result is 82.10% of Person-Organization, 86.91% of PersonPosition and Person-Location is 87.71%. Among Vietnamese
NLP published studies, many researchers have used CRFs
and it conduced to good results. In 2006, Cam Tu Nguyen et.
al used CRF in Vietnamese word segmentation [1] the results
has average of recall, precision and F1 is 93.76%, 94.28%
and 94.05% respectively. Our previous research [7] gave an
experimental study on Vietnamese POS tagging using three

machine learning methods (2009), the results shown that
using CRF gave the best result (90.17% average precision).
Recently, in 2010, Rathany Chan Sam et. al. [8] developed a
Vietnamese relation extraction system based on CRF.
Because of the good results when applying CRFs to the
Vietnamese NLP problem described above, we decided to
use it to resolve our sequential labeling problem.
III.

IV.

OUR PROPOSED MODEL

A. Analyzing proposed model
We proposed an integrated model in which named entity
recognition and person property extraction simultaneous
processed because of three main reasons: Firstly, the
common pipe-line approaches recognizing named entity and
extracting relations in turn had some limitations. Secondly,
both named entity recognition and person properties
extraction can be solved as a sequential labeling problem.
Thirdly, after data surveying, we perceived that the tag of
person entity, property type and property values are less
ambiguous so they can share the same tags set. Our model
consisted of three main phases as illustrated in Fig.1.
1) Phase 1- Sentences tagger training
This phase received the input is training set of sentences
and generated the tagger model.
We used a tagged training set to train a tagger model using
CRFs. Pre-processing includes tokenization, wordsegmentation, chunking, etc. We annotated the training set

with named entity and person properly manually with 43 tags
of 21 labels. Note that some of these properties might be
entity like date, location, organization etc. Unlike in
conventional entity recognition, we used the appropriate tags
to facilitate the determination of whether that entity is under
any type of property. In this phase, we extracted and selected
rich feature set obtained by using various kinds of
knowledge, this features will be describe in section IV.B. For
increasing the tagging results, Freebase 2 English person
name dictionary and our three Vietnamese supporting
dictionaries (Vietnamese person name dictionary,
Vietnamese location dictionary and prefix for people,
locations and organizations dictionary) were used.
2) Phase 2 - Sentences tagging
The input of this phase was test set and the output was
tagged sentences set.
In this phase, we used tagger model obtained in phase 1
to tag the test set. The data was also annotated with named
entities and person properties to evaluate the result, these
tags are not used in process.
3) Phase 3 - Sentences filtering
Used tagged data obtained in phase 2, phase 3 retained the
appropriate sentence. Person property relation is always
include three following elements: Key person entity,
property type (such as other name, date of birth, etc) and
property value (a specific value of the property relation
expressed in words, such as May 2nd might be a value of
date of birth property). In there, type of property can be
identified by words or hidden, but two other elements (key

CONDITIONAL RANDOM FIELDS

Conditional Random Fields (CRFs) was first introduced in
2001 by Lafferty, McCallum and Pereira [5], it is a statistical
sequence modeling framework for labeling and segmenting
sequential data.
Several advanced convex optimization techniques can be
used to train CRFs. Because it has been found that a LBFGS and Newton methods converges much faster [9], we
choose using L-BFGS method for CRFs optimization in our
system.
For smoothing, a Gaussian Prior is a well-known method
and has been used by many researchers (such as Chen and
Rosenfeld (1999), Sha and Pereira (2003)). In research of C.
Sutton and A. McCallum (2006) [2], when they set CRF
parameter GaussianPriorVariance as factor of 10, the
results were best. In our proposed model, we used Gaussian
Prior for smoothing and set GaussianPriorVariance = 10.
To train CRFs with the given training data, we use multithread CRFs training so allows it to operate faster, Number
Threads was set = 4.
Because of the good results when applying CRFs to the
Vietnamese NLP problem described in section II, we

2

116

/>

conjunction, regular expression and Vietnamese syllable
detection based on this work.

In addition, we used Freebase person name dictionary
(1,397,865 words) and our three supporting dictionaries to
extract more useful features, they are:
- Vietnamese person name dictionary has 20,669 words.
- Vietnamese location dictionary has 18,331 words.
- Prefix dictionary included person prefix (like “ngài”
(Mr.), “PGS.” (Assoc.), etc), location prefix (like “Qu̵n”
(district), “Thành ph͙” (city), etc) and organization prefix
(like “tr˱ͥng ÿ̩i h͕c” (university), “công ty” (company),
etc). This dictionary has 790 Vietnamese words.

person entity and property values) must always appear in the
sentence. Because of this rule, in this phase, we removed all
sentence does not contain any key person entity or property
value.
Preprocessing

Training set

CRF training

Features
extraction

Features
selection

TABLE I.
CRF model

Dictionaries

No
1
2

Features type
Current word
POS tag of the current word

3

Is current word is lowercase, initial
capitalization or all capitalization?

4
5

Test set
CRF tagging

6
Preprocessing

Features
extraction

Filter

Context words

Syllable Conjunction
Regular Expression tries to capture
expressions describing date/time,
numbers, marks, etc

7

Vietnamese Syllable Detection

8

Is this word a valid entry in name
dictionaries?

9

Is the previous word of considering
word a valid entry in prefix
dictionaries?

Tagged
data
Features
selection

THE PROPOSED FEATURE SET

V.

Notation

W0
POS (W0)
Is_Lower(0,0)
Is_Initial_Cap (0,0)
Is_All_Cap (0,0)
Wi (i = -2,-1,1,2))
Syllable_Conj (-2,2))
Regex(0,0)
Is_Valid_Vietnamese
_Syllable(0,0)
dict:name,
dict:first_name
dict:vname
dict:vfirst_name
prefix:per
prefix:loc
prefix:org

EXPERIMENTAL RESULTS AND DISCUSSIONS

A. Experiments set up
• Data set had 2,700 sentences crawled from
Vietnamese Wikipedia 3 . These sentences were
tagged manually.
• We used 10-fold cross validation in experiments.
• The comparison was using recall, precision and Fmeasures for the overall results and for each
property.

Phase 1
Phase 2

Phase 3

Results

B. Experimental results and discussions
1) Experiment results of whole system: We got average
results were 84 % of precision, 82.56% of recall and 83.39
% of F-measure. The best results were 92.19%, 90.22% and
91.19% in turn.
Preliminary, this is a quite good result, but since other
similar researches do not use the same corpus as ours, we
can not compare directly our result with others. Although
our target and WePS-2 attribute extraction task [10] are both
extracting person property, but there is no basis for
comparison, too. This subtask in WePS-2 is to extract 18
attribute values of target individuals at document-level. The
differences are not only of property types set but also of the
complexity. At document-level, there are many sub-

Figure 1. The proposed model

B. Design feature set for proposed model
Our previous work in 2010 [11] showed that the use of
various kind of knowledge resources might contribute to
improve the result of NLP problems. In this paper, we
designed a rich feature set listed in table I by integrating
some kinds of feature as mentioned below:
Features of context and current word itself are used, they
are quite similar to feature set used in [7, 11].
The research of Cam Tu Nguyen et. al [1] summarized

general structure of Vietnamese formation (included
structure of syllable, words and Vietnamese new words).
Based on this summarization, they proposed some types of
context predicate templates from which various features will
be generated correspondingly. We used features of syllable

3

117

/>

simultaneously for Vietnamese. The machine learning
method CRF was applied to resolve this problem as a
sequential labeling problem. The proposed model consisted
of three phases: Training CRF model, CRF tagging and
filtering. We also exploited a rich feature set by using
various kinds of knowledge resources.
Experiments were conducted on 2700 manually
annotated sentences, 10 frequent property types had choose
for extract. The obtained results showed that our method is
suitable for Vietnamese with the best result is 92.19% of
precision, 90.22% of recall and 91.19% of F. Moreover,
performance time was pretty good, satisfy to apply in
realistic problems. In addition, the result of evaluating on
each tag showed that some tag have better result because
they took advantage of useful features we proposed.

problems of ambiguities have to solve (some of them was
mentioned in [10]). Our work was conducted in sentenceslevel, ambiguities are less and easier to solve. Thus, low

results of WePS show that there are still many complex
problems need to be solved, and our research must be
improved. However, the results we achieved are
satisfactory, and there is a great potential for development.
Solving this problem well at sentence-level will be a
precondition for solving it well at document-level.
2) Experiment results of each tag: The average
experimental results of each tag are in table II.
TABLE II.
No
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

21

EXPERIMENT RESULTS FOR WHOLE SYSTEM

Tag
OPer
NickPer
RPer
VBornLoc
VDeadLoc
VHomeLoc
VJobOrg
VJob
VSex
VBornTime
VDeadTime
R_OtherName
R_Relationship
R_WhereBorn
R_WhereDead
R_WhenDead
R_Job
R_WhereJob
R_Sex
R_WhenBorn
R_WhenDead

P (%)
91.35
89.88

80.46
83.45
80.35
93.39
78.25
81.49
90.45
83.77
80.40
91.67
81.98
80.89
80.23
85.65
77.35
75.92
73.29
85.75
76.10

R (%)
90.33
90.44
78.65
87.91
80.09
91.77
83.69
78.22
87.56

90.39
87.28
85.19
83.30
81.74
85.36
85.99
75.64
73.21
65.30
83.22
72.77

F (%)
90.84
90.16
79.54
85.62
80.22
92.57
80.88
79.82
88.98
86.95
83.70
88.31
82.63
81.31
82.72
85.82

76.49
74.54
69.06
84.47
74.40

ACKNOWLEDGEMENTS
This work was partly supported by Vietnam National
University Hanoi research project QG.10.38 and TRIG-B.
REFERENCES
[1]

[2]

[3]

[4]

Generally, it is a positive result. The results of property
values were often better than of property types because the
property types are sometimes hidden (not presented by any
word) and influenced by language complexity more than
other tags. Moreover, in property values or property types,
the achieved results were uneven among tags because some
tags always appear in more complex grammar structure so
they are hard to find, in additional, tags take advantages of
useful features like dictionaries, Vietnamese characteristics
(e.g. OPer, NickPer, VHomeLoc) might have better results.
3) Performance evaluation: In experiments, because
phase 1 (training model) can be done offline, so we just

calculated processing time in phase 2 (tagging) and phase 3
(filter). Using personal computer: Chip Intel(R) Core 2 Duo
T7700 @ 2.4GHz, Ram: 2.00 GB, Microsoft Windows 7,
system is programmed in Java Eclipse SDK, average time
for processing an input sentence was 0.173 seconds. Almost
of previous works did not give the processing time, so there
is no basis for comparison, but 0.173s of processing time for
a sentence in an average configuration personal computer is
pretty good results, satisfy to apply in realistic problems.
VI.

[5]

[6]
[7]

[8]

[9]

[10]

CONCLUSIONS

[11]

In this paper, we propose an integrated model
recognizing person entity and extract relevant values of a
pre-defined set of properties related to this person

118

Cam Tu Nguyen, Trung Kien Nguyen, Xuan Hieu Phan, Le
Minh Nguyen, and Quang Thuy Ha, “Vietnamese Word
Segmentation with CRFs and SVMs: An Investigation”, The
20th Pacific Asia Conference on Language, Information, and
Computation (PACLIC), 1st-3rd November, 2006, Wuhan,
China
Charles Sutton and Andrew McCallum, “An Introduction to
Conditional Random Fields for Relational Learning”, in
Introduction to Statistical Relational Learning, Edited by Lise
Getoor and Ben Taskar, MIT Press, 2006
Girju R, “Semantic relation extraction and its applications”,
ESSLLI 2008 Course Material, Hamburg, Germany, 4-15
August 2008
Javier Artiles, Andrew Borthwick, Julio Gonzalo, Satoshi
Sekine, and Enrique Amigó, “WePS-3 Evaluation Campaign:
Overview of the Web People Search Clustering and Attribute
Extraction Tasks”, in the 3rd Web People Search Evaluation
Workshop (WePS 2010)
John D. Lafferty, Andrew McCallum, Fernando C. N. Pereira,
“Conditional Random Fields: Probabilistic Models for
Segmenting and Labeling Sequence Data.”, ICML 2001: 282289
Michele Banko, Oren Etzioni. “The Tradeoffs Between Open
and Traditional Relation Extraction”, ACL 2008: 28-36
Oanh Thi Tran, Cuong Anh Le Quang-Thuy Ha and Quynh
Hoang Le, “An Experimental Study on Vietnamese POS
tagging", International Conference on Asian Language
Processing (IALP 2009):23-27, Dec 7-9, 2009, Singapore
Rathany Chan Sam, Huong Thanh Le, Thuy Thanh Nguyen,

The Minh Trinh, “Relation Extraction in Vietnamese Text
Using Conditional Random Fields”, AAIRS 2010: 330-339
Robert Malouf, "A comparison of algorithms for maximum
entropy parameter estimation", In Proceedings of the Sixth
Conference on Natural Language Learning (CoNLL-2002),
Pages 49-55.
Satoshi Sekine and Javier Artiles, “WePS2 Attribute
Extraction Task”, in the 2nd Web People Search Evaluation
Workshop (WePS 2, 2009)
Tran Thi Oanh, Le Anh Cuong, Ha Quang Thuy (2010),
"Improving Vietnamese Word Segmentation and POS
Tagging using MEM with Various Kinds of Resources",
Journal of Natural Language Processing, 17 (3): 41-60,
5(2):890-909), 2010

DSpace at VNU: An integrated approach using conditional random fields for named entity recognition and person property extraction in Vietnamese text

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về