Tải bản đầy đủ (.pdf) (4 trang)

DSpace at VNU: Named entity recognition for vietnamese documents using semi-supervised learning method of CRFs with generalized expectation criteria

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (280.19 KB, 4 trang )

2012 International Conference on Asian Language Processing

Named Entity Recognition for Vietnamese documents using semi-supervised
learning method of CRFs with Generalized Expectation Criteria
Thi-Ngan Pham1,2,a, Le Minh Nguyen3,b, Quang-Thuy Ha2,c
1

The Vietnamese People’s Police Academy, Hanoi, Vietnam
KTLab, College of Technology (VNU-UET), Vietnam National University, Hanoi (VNU), Hanoi, Viet Nam
3
Japan Advanced Institute of Science and Technology (JAIST), Japan
Email: , ,

2

data is combined with limited supervision, provided by the
human trainer in the form of expected prior label
distribution or class association for features. We find that
GE performs better than both supervised method and
several alternative semi-supervised methods and provides
better accuracy given the same amount of labeling effort.
We perform several experiments using various feature
configurations. We also investigate the effects of using
difference sizes of training data.
The rest of this paper is organized as follows. Recent
NER studies related to our work are introduced in Section
II. Section III briefly introduces about GE criteria and how
GE criteria can be applied to CRF given conditional
probability distributions of labels given features. Our
proposed model is described in section IV. The ways to
design a feature set, create a set of constraints and


preprocess data for the model are considered.
Experimental results and related remarks are presented in
the next section. Conclusions are showed in the last
section.

Abstract—Named Entity Recognition (NER) is an important,
useful task in many natural language processing applications
and much previous work in NER has been done in many
other languages such as English, Japanese, Chinese…
However, Vietnamese NER task is still relatively new and
challenge due to the characteristics of Vietnamese, the lack
of a large annotated corpus… This paper presents a new
approach for Vietnamese NER – a semi-supervised training
method for Conditional random fields (CRFs) models using
generalized expectation criteria to express a preference for
parameter settings. We perform several experiments using
different feature setting and different training data to show
the high performance of this method and compare to the
other method.
Keywords- Generalized Expectation criteria, CRFs, semisupervised learning

I.

INTRODUCTION

NER aims to identify and classify certain proper nouns
into some predefined target entity classes such as Person,
Organization, Location, Numeral expressions, Temporal
expression, Monetary values and Percentage. Many
previous studies in NER have been done in languages such

as English, Japanese and Chinese and get high
performance.
Supervised learning methods require labeled instances,
which are often costly to obtain while semi-supervised
learning methods are an appealing solution for reducing
labeling effort. In recent years, approaches have been
proposed for semi-supervised learning, in which the
approach using generalized expectation criterion has been
much considered. Generalized expectation (GE) criteria [1]
are terms in a training objective function that assign scores
to values of a model expectation. GE provides a method
for incorporating prior knowledge into model training, so
that humans can directly express preferences to the
parameter estimator naturally and easily using the
language of expectations, rather than the often complex
and counter-intuitive language of the parameters.
In this paper, the expectations are model predicted
class distribution conditioned on the presence of selected
features and the score function is the Kullback-Leibler
divergence from reference distributions that are estimated
using existing resources. We apply a semi-supervised
training method for conditional random fields (CRFs) with
generalized expectation criteria that incorporates both
labeled and unlabeled sequence data to estimate a
discriminative structured predictor. CRFs [9] model is a
flexible and powerful one for structured predictors based
on undirected graphical models that have been globally
conditioned on a set of input covariates. Here unlabeled
978-0-7695-4886-9/12 $26.00 © 2012 IEEE
DOI 10.1109/IALP.2012.54


II.

RELATED WORKS

In the past few years, there have been some studies on
NER tasks for Vietnamese that archived certain results.
T.Q.Tran et al [14] presented a Support Vector Machine
(SVM) based NER model which obtained an overall Fmeasure of 87.77%. Hoang-Quynh Le et al [8] presented
an integrated model to recognize person entity and extract
relevant values of a pre-defined set of properties related to
this person simultaneously for Vietnamese. This model
used various kinds of knowledge resources and applied
famous machine learning method CRFs and got an Fmeasure of 83.39%. These two models got acceptable
results but they used supervised learning method that
requires a large amount of labeled training data to get high
performance.
We also mention some recent semi-supervised learning
methods for NER task. B.Mohit and R.Hwa [11] used
Expectation Maximization (EM) algorithm along with
their Naïve Bayes classifier to form a semi-supervised
learning framework. Y.Grandvalet and Y.Bengio [15]
initially proposed a method for semi-supervised CRF
training using entropy regularization and then F. Jiao et al
[2] extended this model to linear-chain CRFs. In general,
entropy regularization is fragile and accuracy gains can
come only with precise settings. Rathany Chan Sam et al
[12] introduced a semi-supervised learning method for
NER task in Vietnamese by combining proper name coreference, named-ambiguity heuristics with CRF model.
The F-measure of this model for extracting “Person”,

“Location” and “Organization” entities are 93.13%,
88.15% and 79.35%.
85


There have been few different approaches trying to
learn from alternative forms of labeled resources.
R.Schapire et al [13] presented a method in which features
are annotated with their associated majority labels and
used this information to bootstrap a parameterized text
classification model. However, this model required some
labeled data in order to train their model. A.McCallum et
al [1] introduced a special case of GE, label regularization,
and demonstrate its effectiveness for training maximum
entropy classifiers. G.Druck et al [3] also used GE with
full distributions for semi-supervised learning of maximum
entropy models, except here the distributions are on labels
conditioned on features. G.S.Mann and A.McCallum [4]
presented GE criteria for linear-chain conditional random
fields, a new semi-supervised training method that makes
use of labeled features rather than labeled instances. This
model uses conditional probability distributions of labels
given features and can dramatically reduce annotation
time.
Our research bases on the semi-supervised learning
method of CRF using GE criteria [4] applying to
Vietnamese and gains a better result than [8], [12], [14].
L.Yao et al [10] also based on this model applying to
Biomedical NER and got high performance. This approach
will be discussed in detailed in Section III.

III.

Thus a GE criterion may be specified independently
of the parameterization, and independently of choices of
any conditioning data sets. A GE criterion may operate on
some arbitrary subset of the variables in x. The scoring
function G and the distance function ο may be based on
information theory, or be arbitrary functions.
For the purpose of this paper, we set the GE criterion
objective function term is
, in which the
functions to be conditional probability distributions and
ο(p,q)=D(p||q), the KL-divergence between two
distribution. For semi-supervised training of CRF, we
augment the objective function with the regularization
term:

Where

With the unnormalized potential:
Where fm(x,j) is a feature that depends only on the
{j :
observation sequence x, and j* is defined as
fm(x,j)=1}, and Um is the set of sequences where fm(x,j) is
present for some j.

GENERALIZED EXPECTATION CRITERIA FOR
CONDITIONAL RANDOM FIELDS

Conditional Random Fields (CRFs) was first introduced

in 2001 by J.Lafferty et al [9], it is a statistical sequence
modeling framework for labeling and segmenting
sequential data. This model overcomes the weakness of
HMM and MEMM, so it have been used in many approach
in labeling tasks. In this paper, we focus on GE criteria and
how to use GE for CRF.
Let X be some set of variables, with assignments
denoted x‫ࣲא‬Ǥ ‡– Ʌ „‡ –Š‡ ’ƒ”ƒ‡–‡”• ‘ˆ •‘‡ ‘†‡Ž
–Šƒ– †‡ˆ‹‡ ƒ ’”‘„ƒ„‹Ž‹–› †‹•–”‹„—–‹‘ ‘˜‡” ǡ ’Ʌሺሻ.
The expectation of some function f(X) according to the
model is

> f ( X )@ ¦
x X

IV.

OUR PROPOSED MODEL

A. Analyzing proposed model
Our general NER system includes two main phases as
illustrated in Fig.1.
x Training
x Testing

pT ( x ) f ( x )

Training
Data


Where f(x) is an arbitrary function of the variables x
producing some scalar or vector value. This function of
course may depend only on a subset of the variables in x.
A set of assignments to input variables (training data
instances)
may be provided, and the
conditional expectation is then

CRFs
learning

GE
Constraints

Phase 1

ET

is given as a target distribution and

NER model
Phase 2

Testing
Data

A generalized expectation (GE) criterion [1] is a term in
a parameter estimation objective function that expresses
some preference about the model’s expectation of variable
values. That is, a generalized expectation criterion is a

function, G, that takes as an argument the model’s
expectation of f(X), and return a scalar, which is added as a
term in the parameter estimation objective function:

Decoding

Output

Figure 1. The semi-supervised CRF model with Generalized
Expectation

Both the training and testing processes are conducted
using Mallet toolkit 1 .
1) Phase 1 – Training phase
In this phase, the input includes training data with
correct labels and the GE criteria, in which the GE criteria
are expressed in term of a set of GE constraints taking
advantage of conditional probability distributions of labels

In some cases G might be defined based on a distance
to some “target value” for EɅሾˆሺሻሿǤ Let „‡ –Š‡ –ƒ”‰‡–
˜ƒŽ—‡ ƒ† Ž‡–
„‡ •‘‡ †‹•–ƒ…‡ ˆ—…–‹‘Ǥ
 –Š‹•
…ƒ•‡ǡ G ‹‰Š– „‡ †‡ˆ‹‡†ǣ

1

86


/>

given features. We extract features of words in the training
data, combining the probability distributions over labels in
constraints, then used CRF to train and got a NER model.
2) Phase 2 – Testing phase
In testing phase, the testing data will be recognized by
the NER model constructed in the training phase.

the feature “Hӗ_Chí_Minh”/“Ho_Chi_Minh” (in this case,
the feature is a word) may be a person’s name or stand in
name of an organization “ĈRjQ WKDQKBQLrQ Fӝng_sҧn
Hӗ_Chí_Minh”/“The Communist Youth Union of
Ho_Chi_Minh” or stand in name of a location “Thành_phӕ
Hӗ_Chí_Minh”/ “Ho_Chi_Minh city”. We calculate the
probability that the feature “Hӗ_Chí_Minh” belongs to one
group based on the context (relation with the previous and
the next position) and the frequency of this feature in
document generally. A constraint has general format as
follows.
Feature_name
label_name1
=
probability1
label_name2 = probability2…
We can obtain constraints manually or automatically.
There are some automated methods for obtaining
constraints such as method of user-provided labeled
features, method of machine-provided candidate features
using Latent Dirichlet Allocation (LDA) based method of

[6], [10]…
In our experiments, we use the second method - LDA.
LDA is a generative probabilistic model for extracting
latent topics from collections of discrete data such as text
corpora. To build corpus for LDA, we extract features of
each word within the unlabeled data set to get LDA
samples. We use LDA to cluster unlabeled data into latent
topics; sort the frequencies of features for each topic and
select the most prominent features in each cluster. After
the statistical analysis of feature labeling, we obtain a set
of constraints. Example of one constraint:

B. Preprocessing data, feature set and constraints for
proposed model
1) Preprocessing data
We recognize entities according to CoNLL2003 shared
task, in which we concentrate on three types of named
entities: Person, Location and Organization. We also add
the feature of part of speech (POS) tag so each token in
data has the following format:
<Word> <POS tag> <NER tag>
The process of making the training and testing data can
be described as follows: Firstly, we collect and pick
hundreds of articles from some popular Vietnamese
newspapers such as VnExpress 2 , Vietnamnet 3 ,
Tienphongonline 4. Then, vnTagger 5 is used for the word
segmentation and the POS label assignment. After that,
NER labels are assigned manually.
TABLE I.
Feature

INITCAPS
CAPSMIX
ALLCAPS
HASDIGIT
SINGEDIGIT
DOUBLEDIGIT
TABLE II.

THE ORTHOGRAPHIC FEATURES

Meaning
The initial letter of the current
word is capitalized
Some letters in the current
word is capitalized
All the letters of the current
word are capitalized
The current word includes
numbers
The current word is a number
between 0 and 10
The current word is a number
between 10 and 99

Examples
7KjQKBSKӕ
+jB1ӝL

WORD=ViӋt_Nam
B-LOC:0.33333331111111303

BMISC:1.3333332177777873E-8 B-ORG:1.3333332177777873E-8 BPER:1.3333332177777873E-8
I-LOC:1.3333332177777873E-8
IORG:0.33333331111111303
I-PER:0.33333331111111303
O:1.3333332177777873E-8

TPHCM
PC14

The constraints have great influence on the system
results. We try several sets of constraints and get some
different results. How to build an effective set of
constraints is still a target of active researches. The
constraints should be balanced among labels and cover
many documents.

5
50

THE FEATURE OF POS TAG AND LEXICAL INFORMATION

L-2, L-1,L0,L1,L2
L-2L-1, L-1L0, L0L1, L1L2
L-2L-1L0, L-1L0L1, L0L1L2
P-2, P-1,P0,P1,P2
POS
P-2P-1, P-1P0, P0P1, P1P2
P-2P-1P0, P-1P0P1, P0P1P2
L-2P-2, L-1P-1, L0P0, L1P1, L0P0, L1P1,
Combination

L2P2
2) Feature set
Apart from the orthographic features of current word
(see Table I), we also utilize POS tag and lexical
information as the features. Denote Pos0 and Lex0 are
respectively POS tag and lexical information at current
position. Posn and Lexn are POS tag and lexical
information in n window where n is window size. In our
experiments, we set the windows size to 5 (see Table II)
3) Constraints
In order to use GE criteria in the model, we build a set of
GE constraints which expresses the conditional probability
distributions of given features. For example, we know that

V.

Lexical

EXPERIMENTAL RESULTS AND DISCUSSIONS

A. Experimental setup
x We use three training data sets: data1 - 500 tokens,
data2 - 1000 tokens and data3 - 1500 tokens.
x The testing data is about 500 tokens
x Three sets of constraints are used with the size of
614, 669 and 914 constraints.
The training and testing data sets are picked up
randomly from processed data of 5000 tokens. We also
create three sets randomly from 1300 objects of unlabeled
data to build three constraint sets mentioned above. In

order to evaluate the effect of training data and constraint
set on the result of recognizing entities, we take
experiments using the same testing data, with 3 training
data and 3 sets of constraints. In each experiment, we
assign NER label using 2 models: Supervised learning
CRF model and Semi-supervised learning CRF using GE
criteria model. We use precision, recall and F-measure as
the evaluation measure.

2




5
Le Hong Phuong: />3
4

87


B. Experimental results and discussions

applying the “feature” labeling instead of “instance”
labeling to get training data saves lots of work annotating
and incorporates unlabeled data. One of the most important
components in our NER system is building the set of
constraints to calculate generalized expectation criteria. A
set of constraints had been determined. The better set of
constraints expresses, the better NER result is, then we are

going to build effective sets of constraints.

The Table III shows the results of experiments with
three training data sets and the set of 914 constraints.
Generally, it is a positive result. The model of CRF using
GE performs better than the model of CRF in two
experiments using training data 1 with 500 tokens and
using training data 3 with 1500 tokens. Only in experiment
using training data 2 with 1000 tokens, the two models get
the same F-measure. The best result of model CRF using
GE are 88.89% of precision, 91.43% of recall and 90.14%
of F-measure showing the superiority of this model to the
model of supervised CRF.

ACKNOWLEDGEMENTS
This work was partly supported by MOET project –
B2012-01-24, VNU-UET project CN-12.01.
REFERENCES

TABLE III.
ER
PR%
ORG
PER
LOC
ALL

90.00
100.00
12.50

58.33

ORG
PER
LOC
ALL

90.00
100.00
56.25
77.78

ORG
PER
LOC
ALL

100.00
100.00
75.00
88.89

EXPERIMENTS WITH THREE SAMPLES SETS
CRF
CRF-GE
RE%
F1 %
PR%
RE%
500 tokens of training data

75.00
81.82
90.00
100.00
66.67
80.00 100.00
66.67
100.00
22.22
25.00
100.00
72.41
64.62
63.89
82.14
1000 tokens of training data
100.00
94.74
90.00
100.00
83.33
90.91 100.00
90.91
81.82
66.67
56.25
75.00
87.50
82.35
77.78

87.50
1500 tokens of training data
71.43
83.33 100.00
83.33
100.00
100.00 100.00
90.91
100.00
85.71
75.00
100.00
88.89
88.89
88.89
91.43

[1]

F1 %

[2]
94.74
80.00
40.00
71.88

[3]

94.74

95.24
64.29
82.35

[4]

[5]

90.91
95.24
85.71
90.14

[6]

Fig 2 shows the experimental results of semisupervised learning CRF using GE with three different sets
of constraints. The performance of model with constraint
set 2 and set 3 is better than that of model with constraint
set 1 in all three training data sets. The performance of
model with constraint set 2 is better than that of model
with constraint set 3 in two training data1 and training
data2, except in training data 3. These experiments show
the influence of constraints on the performance of the
system.

[7]

[8]

[9]


[10]

[11]

[12]

[13]
Figure 2. The F-measure of the semi-supervised model CRF using GE
with three different sets of constraints

VI.

[14]

CONCLUSIONS

[15]

In this paper, we have proposed a named entity
recognizing model in Vietnamese documents based on
semi-supervised CRF learning using GE criteria. By

88

Andrew McCallum, Gideon Mann, Gregory Druck (2007).
Generalized Expectation Criteria, Technical Report UM-CS-200760, University of Massachusetts Amherst, August, 2007
F. Jiao, S. Wang, C.-H. Lee, R. Greiner, and D. Schuurmans
(2006). Semi-supervised conditional random fields for improved
sequence segmentation and labeling. In COLING/ACL.

Gregory Druck, Gideon Mann, Andrew McCallum (2007).
Leveraging Existing Resources using Generalized Expectation
Criteria, NIPS WS, 2007
Gideon S. Mann, Andrew McCallum (2008). Generalized
Expectation Criteria for Semi-Supervised Learning of Conditional
Random Fields, ACL-08 (HLT): 870–878, 2008.
Gideon S. Mann, Andrew McCallum (2010). Generalized
Expectation Criteria for Semi-Supervised Learning with Weakly
Labeled Data, Journal of Machine Learning Research 11: 955-984,
2010.
Gregory Druck, Gideon Mann and Andrew McCallum (2008).
Learning from Labeled Features using Generalized Expectation
Criteria, SIGIR 08.
Gregory Druck, Gideon Mann, Andrew McCallum (2009). Semisupervised Learning of Dependency Parsers using Generalized
Expectation Criteria, The 47th Annual Meeting of the ACL and the
4th IJCNLP of the AFNLP: 360–368.
Hoang-Quynh Le, Mai-Vu Tran, Nhat-Nam Bui, Nguyen-Cuong
Phan, Quang-Thuy Ha (2011). An Integrated Approach Using
Conditional Random Fields for Named Entity Recognition and
Person Property Extraction in Vietnamese Text, IALP 2011:115118, Penang, Malaysia
John Laferty, Andrew McCallum, Fernando Pereira (2001).
Conditional Random Fields: Probabilistic Models for segmenting
and labeling Sequence Data. In Proc. of the Eighteenth
International Conference on Machine Learning (ICML-2001).
Lin Yao, Chengjie Sun, Yan Wu, Xiaolong Wang, Xuan Wang
(2011). Biomedical named entity recognition using generalized
expectation criteria, Int. J. Mach. Learn. & Cyber. (2011) 2:235–
243.
Mohit, B., Hwa, R. (2005). Syntax-based semi-supervised Named
Entity Tagging. In Proceedings of the ACL Interactive Poster and

Demonstration Sessions, pp. 57-60, Michigan.
Rathany Chan Sam, HuongThanh Le, ThuyThanh Nguyen,
ThienHuu Nguyen (2011). Combining Proper Name-Coreference
with Conditional Random Fields for Semi-supervised Named
Entity Recognition in Vietnamese Text. The 15th Pacific-Asia
Conference on Knowledge Discovery and Data Mining (PAKDD),
pp. 512-525.
R. Schapire, M. Rochery, M. Rahim and N. Gupta (2002).
Incorporating prior knowledge into boosting. In ICML.
T,Q. Tran, T,X,T. Pham, H,Q. Ngo, D. Dinh, C. Nigel (2007).
Named Entity Recognition in Vietnamese Documents, Journal of
“Progress in Informatics", NII (National Institute for Informatics),
Tokyo, Japan, Vol. 2007, No.4, pp.1-9
Y. Grandvalet and Y. Bengio (2004). Semi-supervised learning by
entropy minimization. In NIPS.



×