Tải bản đầy đủ (.pdf) (50 trang)

Nhận dạng thực thể có tên trong văn bản tiếng việt

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (941.35 KB, 50 trang )

UNIVERSITY OF ENGINEERING AND TECHNOLOGY
VIETNAM NATIONAL UNIVERSITY, HANOI

LE NGOC ANH

NAMED ENTITY RECOGNITION FOR
VIETNAMESE DOCUMENTS

MASTER’S THESIS OF INFORMATION TECHNOLOGY

Ha noi - 2015


UNIVERSITY OF ENGINEERING AND TECHNOLOGY
VIETNAM NATIONAL UNIVERSITY, HANOI

LE NGOC ANH

NAMED ENTITY RECOGNITION FOR
VIETNAMESE DOCUMENTS

Major: Computer Science
Code: 60.48.0101

MASTER’S THESIS OF INFORMATION TECHNOLOGY

Supervisor: Assoc.Prof.Dr Le Anh Cuong

Ha noi - 2015



i

Originality statement
“I hereby declare that the work contained in this thesis is of my own and has not been
previously submitted for a degree or diploma at this or any other higher education
institution. To the best of my knowledge and belief, the thesis contains no materials
previously published or written by another person except where due references or
acknowledgements are made.”
Signature:…………………………………………


ii

Supervisor’s approval
“I hereby approve that the thesis in its current form is ready for committee
examination as a requirement for the Master of Computer Science degree at the
University of Engineering and Technology.”
Signature:………………………………………………


iii

Abstract
Named Entity Recognition (NER) aims to extract and classify words in documents into
pre-defined entity classes. It is fundamental for many natural language processing
tasks such as machine translation, information extraction and question answering.
NER has extensively studied for other languages such as English, Japanese and
Chinese, etc. However, NER for Vietnamese is still a challenging due to its
characteristics and the lack of Vietnamese corpus.
In this thesis, we study approaches to NER including handcrafted rules, machine

learning and hybrid methods. We present challenges in NER for Vietnamese such as
the lack of standard evaluation corpus and the standard methods for constructing data
set. Specially, we focus on labeling entities Vietnamese since most study has not
presented the detail of handcrafting entities in Vietnamese. We apply supervised
machine learning methods for Vietnamese NER based on Conditional Random Field
and Support Vector Machine with changes in feature selection suitable for
Vietnamese. The evaluation shows that these methods outperform other traditional
methods in NER, such as Hidden Markov Model and rule-based methods.


iv

Aknowledgement
First, I would like to thank my supervisor Assoc. Prof. Dr. Le Anh Cuong for his
advice and support. This thesis would not have been possible without him and without
freedom and encouragement he has given me over the last two years I spent at the
Faculty of Technology of University of Engineering and Technology, Vietnam
National University (VNU), Ha Noi.
I have been working with amazing friends in the K19CS class. I dedicate my gratitude
to each of them: Tai Pham Dinh, Tuan Dinh Vu, and Nam Thanh Pham. I would
especially like to thank the teachers in University of Engineering and Technology,
VNU for the collaboration, great ideas and feedbacks during my dissertation.
Finally, I thank my parents and my brother, Hoang Le, for encouragement, advice and
support. I especially thank my wife, Linh Thi Nguyen, and my lovely daughter, Ngoc
Khanh Le, for their endless love and sacrifice in the last two years. They gave me
strength and encouragement to do this thesis.

Ha Noi, September, 2015

Le


Ngoc

Anh


v

Contents
Supervisor‟s approval ..................................................................................................... ii
Abstract .......................................................................................................................... iii
Aknowledgement ............................................................................................................iv
List of Figures ............................................................................................................... vii
List of Tables ............................................................................................................... viii
List of Abbreviations ......................................................................................................ix
Chapter 1 Introduction .....................................................................................................1
1.1 Information Extraction ..........................................................................................1
1.2 Named entity recognition ......................................................................................3
1.3 Evaluation for NER ...............................................................................................4
1.4 Our work ................................................................................................................4
Chapter 2 Approaches to Named Entity Recognition .....................................................6
2.1 Rules based methods .............................................................................................6
2.2 Machine learning methods ....................................................................................7
2.3 Hybrid methods ...................................................................................................17
Chapter 3 Feature Extraction .........................................................................................18
3.1 Characteristics of Vietnamese language ..............................................................18
3.1.1 Lexical Resource ..........................................................................................18
3.1.2 Word Formation ...........................................................................................18
3.1.3 Spelling Variation .........................................................................................18
3.2 Feature selection for NER ...................................................................................19

3.2.1 Feature selection methods ............................................................................20
3.2.2 Mask methods ...............................................................................................21
3.2.3 Taxonomy of features ...................................................................................21
3.3 Feature selection for Vietnamese NER ...............................................................23
4.1 Data preparation ..................................................................................................26
4.2 Machine learning methods for Vietnamese NER ................................................29
4.2.1 SVM method ................................................................................................ 29


vi
4.2.2 CRF method..................................................................................................30
4.3 Experimental results ............................................................................................31
4.4 An example of experimental results and error analysis ......................................32
Chapter 5 Conclusion ....................................................................................................37
References .....................................................................................................................38


vii

List of Figures
Figure 1.1: Example of automatically extracted information from a news article on a
terrorist attack. Source [18] .............................................................................................1
Figure 2.1: Directed graph represents HMM ..................................................................7
Figure 2.2: How to compute transition probabilities ....................................................10
Figure 2.3: A two dimensional SVM, with the hyperplanes representing the margin
drawn as dashed lines. ...................................................................................................12
Figure 2.4: The mapping of input data from the input space into an infinitely
dimensional Hilbert Space in a non-linear SVM classifier. Source [17]. .....................14
Figure 3.1: A taxonomy of feature selection methods. Source [21]. ............................20
Figure 4.1: Generating training data stages ..................................................................27

Figure 4.2: Vietnamese NER based on SVM ...............................................................30
Figure 4.3: Vietnamese NER based on CRF ................................................................ 31


viii

List of Tables
Table 3.1: Word-level features ......................................................................................22
Table 3.2: Gazetteer features .........................................................................................23
Table 3.3: Document and corpus features .....................................................................23
Table 3.4: Orthographic features for Vietnamese .........................................................24
Table 3.5: Lexical and POS features for Vietnamese ...................................................24
Table 4.1: An example of a sentence in training data ...................................................28
Table 4.2: Statistics of training data in entity level .......................................................28
Table 4.3: The number of label types in training data and test data .............................29
Table 4.4: Results on testing data of SVM Learner ......................................................31
Table 4.5: Results on testing data of NER using CRF method .....................................32
Table 4.6: Annotating table ...........................................................................................32


ix

List of Abbreviations
Abbreviation
CRF
HMM
IE
MEMM
NER
SVM

ML
MUC
CoNLL
MET
SigNLL
POS
IIS

Stand for
Conditional Random Field
Hidden Markov Model
Information Extraction
Maximum Entropy Markov Model
Named Entity Recognition
Support Vector Machine
Machine Learning
Message Understanding Conferences
Conferences on Natural Language Learning
Multilingual Entity Tasks
Special Interest Group on Natural Language Learning
Part of Speech
Improved iterative scaling


1

Chapter 1 Introduction
1.1 Information Extraction
Information Extraction (IE) is a research area in Natural Language Processing (NLP. It
focuses on techniques to identify a predefined set of concepts in a specific domain,

where a domain consists of a text corpus together with a well-defined information
need. In other word, IE is about deriving structured information from unstructured
text. For instance, we are interested in extracting information on violent events from
online news, which involes the identification of the main actors of the event, its
location and number of people affected [18]. Figure 1.1 shows an example of a text
snippet from a news article about a terrorist attack and the structured information
derived from that snippet. The process of extracting such structured information
involves the identification of certain small-scale structures such as noun phrases
denoting a person or a group of persons, geographical references and numerical
expressions, as well as finding semantic relations between them. However, in this
scenario some domain specific knowledge is required (e.g., understanding the fact that
terrorist attacks might result in people being killed or injured) in order to correctly
aggregate the partially extracted information into a structured form.
“Three bombs have exploded in north-eastern Nigeria, killing 25 people and
wounding 12 in an attack carriedout by an Islamic sect. Authorities said the bombs
exploded on Sunday afternoon in the city of Maiduguri.”

Figure 1.1: Example of automatically extracted information from a news article on a
terrorist attack. Source [18]


2
Starting from 1987, a series of Message Understanding Conferences (MUC) has been
held which focus on the following domains:







MUC-1 (1987), MUC-2 (1989): Naval operations messages.
MUC-3 (1991), MUC-4 (1992): Terrorism in Latin American countries.
MUC-5 (1993): Joint ventures and microelectronics domain.
MUC-6 (1995): News articles on management changes.
MUC-7 (1998): Satellite launches reports.

The significance of IE is related to the growing amount of information available in
unstructured form. Tim Berners-Lee, who is the inventor of the World Wide Web
(WWW), refers to the existing Internet as the web of documents and advocates that
more content will be available as a web of data. Until this transpires, the web largely
consists of unstructured documents without semantic metadata. Knowledge contained
in these documents can be more accessible for machine processing by transforming
information into relational form, or by marking-up it with XML tags. For instance, an
intelligent agent monitoring a news data feed requires IE to transform unstructured
data (i.e. text) into something that can be reasoned with. A typical application of IE is
to scan a set of documents written in a natural language and populate a database with
the extracted information.
IE on text aims at creating a structured view i.e., a representation of the information
that is machine understandable. According to [18], the classical IE tasks include:
Named Entity Recognition addresses the problem of the identification (detection) and
classification of predefined types of named entities, such as organizations (e.g., „World
Health Organisation‟), persons (e.g., „Muammar Kaddafi‟), place names (e.g., „the
Baltic Sea‟), temporal expressions (e.g.,„1 September 2011‟), numerical and currency
expressions (e.g., „20 Million Euros‟), etc.
Co-reference Resolution requires the identification of multiple (co-referring)
mentions of the same entity in the text. For example, "International Business
Machines" and "IBM" refer to the same real-world entity. If we take the two sentences
"M. Smith likes fishing. But he doesn't like biking", it would be beneficial to detect
that "he" is referring to the previously detected person "M. Smith".
Relation Extraction focuses on detecting and classifying predefined relationships

between entities identified in text. For example:
 PERSON works for ORGANIZATION (extracted from the sentence "Bill works
for IBM.")
 PERSON located in LOCATION (extracted from the sentence "Bill is in France.")


3
Event Extraction refers to the identification of events in free text and deriving
detailed and structured information about them. Ideally, it should identify who did
what to whom, when, where, through what methods (instruments), and why. Normally,
event extraction involves extraction of several entities and relationship between them.
For instance, extraction of information on terrorist attacks from the text fragment
„Masked gunmen armed with assault rifles and grenades attacked a wedding party in
mainly Kurdish southeast Turkey, killing at least 44 people.‟ would reveal the
identification of perpetrators (masked gunmen), victims (people), number of
killed/injured (at least 44), weapons used (rifles and grenades), and location (southeast
Turkey).
1.2 Named entity recognition
In IE, name entity recognition (NER) is slightly more complex. The named entities
(e.g. the location, actors and targets of a terrorist event) need to be recognized as such.
This NER task (also known as „propername classification‟) involves the identifcation
and classification of named entities: people, places, organizations, products,
companies, and even dates, times, or monetary amounts. For example, Figure 1.2
demonstrates an English NER system which identifies and classifies entities in text
documents. It identifies 4 type entities including person, location, organization and
misc.

Figure 1.2: A named entity recognition system. Source 1

1


/>

4
Previous NER studies mainly focus on popular languages such as English, France,
Spanish and Japanese. Methods developed in these studies are based on supervised
learning. Tri et al [22] performed NER using Support Vector Machine (SVM) and
obtained an overall F-score of 87.75%. The VN-KIM IE system2 builts an ontology
and then applied it to Jape grammars to define target named entities in the web.
Nguyen Ba Dat et al [7] employed rule-based approach for Jape grammars plug-in in
Gate framework for NER. For example, the text “Chủ tịch nước Trương Tấn Sang
sang thăm chính thức Nhật Bản vào ngày 20/7/2015” will be annotated text as follows:
Chủ tịch nước <PER>Trương Tấn Sang</PER> sang thăm chính thức <LOC>Nhật
Bản</LOC> vào ngày 20/7/2015./ President Truong Tan Sang visited Japan on
20/07/2015.
1.3 Evaluation for NER
To evaluate a NER algorithm, many metrics, which have been developed in data
mining and machine learning, can be used. These metrics measure the frequency with
which an algorithm makes correct or incorrect identification and classification of
named entities. Most common measures are:
Precision: measures the the ratio of relevant items selected to number of items
selected.

where Nrs is the number of recommended relevant items and Ns is the total number of
recommended items.
Recall: is defined as the ratio of relevant items selected to total number of relevant
items available.

F1 measure: combines precision and recall into one single measure.


1.4 Our work
In this thesis, we study approaches to NER in some languages including rule-based,
machine learning and hybrid methods. Because of the relative independence of
machine learning methods in new domains, we choose two methods including SVM
and CRF in our experiment. In addition, we build a Vietnamese data set for our
2

/>

5
evaluation. This data set consists of 10.000 Vietnamese sentences from VLSP3
project. First, the raw data is processed to get the word-segmented data. Then the
word-segmented data is manual annotated to get NER data. In another stage, the wordsegmented data is automatically tagged by Part of speech (POS) tagging tool. Finally,
data from these two stages are combined to have the final data which is used for our
experiments.
We study feature extraction methods to improve system performance and learning
time. In Vietnamese NER problem, we focus on characteristics of Vietnamese
language and the method to select features. We compare the performance of our
method with the well-known methods such as CRF and SVM.
The rest of this thesis is organized as follows. Chapter 2 introduces approaches to
NER. The Chapter 3 presents about how to extract features from data which is used for
training in supervised learning machine, characteristics of Vietnamese language and
features for Vietnamese NER, Chapter 4 shows our experiences for NER in
Vietnamese document. Finally, in Chapter 5, we discuss the summary of our thesis and
our future work.

3

:8080



6

Chapter 2 Approaches to Named Entity
Recognition
This chapter reviews popular methods for the dectection and classification of the
Named Entities. The chapter is organized as follows. In section 2.1, we present rules
based methods. In section 2.2, machine learning methods and their variations will be
described in details. Section 2.3 shows hybrid methods for recognising NE. The
methods presented in this chapter are the basic of our approach for Vietnamese NER.
2.1 Rules based methods
This method relies on human intuition of designers who assemble a large number of
rules capturing the human intuitive notions. For example, in many languages, usually
person names are preceded by some kind of title. For examples, the name in the text
“Ông Ôn_Gia_Bảo đã đến thăm Việt_Nam vào năm 2004/Mr. Wen Jiabao visited
Vietnam in year 2004" can be easily discovered by the rules like [21]: (1)title
capitalized_word => title PERSON.
In another example, the left or right context of expression is used to classify a named
entity. For instance, the location in the text “Tỉnh Quảng_Ninh đang phải hứng chịu
trận mưa lịch sử/Quảng_Ninh province is experiencing the greatest raining.” can be
recognized by the rule: location_marker capitalized_word => location_marker
LOCATION.
In many studies, rules based methods are combined with other techniques to filter
news by fine-grained knowledge-based categorization [2].
In [6], a rule-based system for Vietnamese NER is used where rules are incrementally
created while annotating a named entity corpus. It is the first publicly avaiable opensourced project for building an annotated named entities corpus. The NER system built
in this project archieves high performance with an F-measure of 83%. The VN-KIM
IE system uses Jape grammars to recognize entities of various types (Organization,
Location, Person, Date, Time, Money and Percent) with an F-measure of 81%.
In comparation with the machine learning approaches, the advantage of the rules based

approach is that it does not need a large annotated data. That means the system can
activate and get the results immediately after the rules are constructed.


7
However, a major disadvantage of the rule based method is that a certain set of rules
may work well for a certain domain, but may be not suitable for other domains. That
means the entire knowledge base of a rule-based system has to be rewritten to fit the
requirements of a new domain. Furthermore, constructing any rule base of sufficient
size is very expensive in terms of time and money.
2.2 Machine learning methods
Machine learning methods are used in many NER systems for different languages.
There methods include HMM, Maximum Entropy Model (MaMM), SVM and CRF.
This section will describe these methods in details.
2.2.1 Hidden Markov Model
HMMs were introduced in 1960s. It is now applied in many research fields such as
voice recognition, biological information and natural language processing. HMMs are
probabilistic finite state models with parameters for state-transition probabilities and
state-specific observation probabilities. HMMs allow estimating probabilities of
unobserved events. HMMs are very common in Natural Language Processing.
Definition of HMM
Given a token sequence:

, the goal is to find a stochomastic optimal
– hidden states.

tag sequence

Figure 2.1: Directed graph represents HMM
In Figure 2.1,


is the state in time t=i in the chain state S.

The underlying states follow a Markov chain where, given the recent state, the future
stage is independent of the past:

And the transition probalities:


8
Here k,l=1,2,…, M, where M is the total number of states. Initial probabilities of
states: .
for any k,
Following by (2.1),
Given the state

, the observation

For a fixed state, the observation

is independent of other observations and states.
is generated according to a fixed probability law.

Given state k, the probability law of

is specified by

In summary:

Model Parameters

Parameters of an HMM model involved:
 Transition probabilities: A =

. Each

represents the

probability of transitioning from state to
 Initial probilities :
 For each state

;

is the probability that

is a start state.

,

 Emission probabilities: a set B of functions of the form
probability of observation

which is the

being emitted by

Model Learning
Up to now we‟ve assumed that we know the underlying model

We want to maximize the parameters with respect to the current data, i.e., we‟re

looking for a model
, such that


9

Unfortunately, there is no known way to analytically find a global maximum, i.e., a
model
, such that

But it is possible to find a local maximum
Given an initial model

, we can always find a model

, such that

Parameter Re-estimation
Use the forward-backward (or Baum-Welch) algorithm, which is a hill-climbing
algorithm. Using an initial parameter instantiation, the forward-backward algorithm
iteratively re-estimates the parameters and improves the probability by which a given
observation is generated by the new parameters. There are three parameters need to be
re-estimated:


Initial state distribution:



Transition probabilities: ai,j




Emission probabilities: bi(ot)

Re-estimation of transition probabilities:
What‟s the probability of being in state si at time t and going to state sj, given the
current model and parameters?


10

Figure 2.2: How to compute transition probabilities
The intuition behind the re-estimation equation for transition probabilities is

Formally:

Estimation of Initial State Probabilities:
Initial state distribution: is the probability that si is a start state
Re-estimation is easy:

Formally:
Estimation Emission Probabilities:
Emission probabilities are re-estimated as:


11

Formally:


Where
, if
Note that

, and 0 otherwise

here is the Kronecker delta function and is not related to the

in the

discussion of the Viterbi algorithm [19].
Updating HMM model with parameters:

We get to

by the following update rules:
,
In Named Entity Recognition problem, observed sequence is all words in a sentence
and hidden events are labels for observed words, for example B_PER, I_PER, B_LOC,
and I_LOC. So we have to find the chain of labels (the labels of named entities) which
describle the observed words with highest probability.
Limitation of HMM
In sequence problems, we assume that the state time t depends only on the previous
state. However, this assumption is not enough to present relationship between factors
in the sequence.
2.2.2. Support Vector Machine
A SVM is a system which is trained to classify input data into categories. The SVM is
trained on a large training corpus containing marked samples of the categories. Each
training sample is represented by a point plotted onto a hyperspace. The SVM then
attempts to draw a hyperplane between the categories, splitting the points as evenly as

possible. For the two dimensional case, this is illustrated in Figure 2.3.


12

Figure 2.3: A two dimensional SVM, with the hyperplanes representing the margin
drawn as dashed lines.
Suppose that there is training data for the SVM in the form of n k-dimensional real
vectors and integer , where is either 1 or -1. Whether is positive or negative
indicates the category for the vector i. The aim of the training phase is to plot the
vectors in a k-dimensional hyperspace and draw a hyperplane which as evenly as
possible separates points from the two categories.
Suppose that this hyperplane has normal vector w. Then the hyperplane can be written
as the points x satisfying

where

is the offset of the hyperplane from the original w. We choose this

hyperplane so that it maximizes the margin between the points representing the two
categories. Imagine that two hyperplanes lying at the “border” of two regions in each
of which there are only points of either category. These two hyperplanes are
perpendicular to w and cut through the outermost training data points in their
respective regions. Two such planes can be illustrated as dashed lines in Figure 2.3.
Maximizing the margin between the points representing the two categories can be
considered as keeping these two hyperplanes as far apart as possible. The training data
points which end up on the dashed lines in Fig. 2.2 are called Support Vectors, hence
the name Support Vector Machine. The hyperplanes can be described by the equations



13

And

The distance between the two areas is
margin, we need to minimize

. Since the SVM wants to maximize the

. We also do not want to extend the margin

indefinitely, since the training data points should not lay on the margin. Thus, the
following constraints are addded:

for

for

in the first category, and

in the second category. This can be rewritten as the optimization problem of

minimizing

subject to

If we replaces

with


, the Lagrange Multipliers can be used to rewrite this

optimization problem into the following quadratic optimization problem:

where the

are Lagrange multipliers [20]. Data sets which are possible to divide into

two are called linearly separable. Depending on how the data is arranged, this may not
be possible. It is, however, possible to use an alternative model involving a soft margin
[24]. The soft margin model allows for a minimal number of mislabeled examples.
This is done by introducing slack variable
for each training data vector . The
function to be minimized,

is modified by adding a term representing the slack

variables. This can be done in several ways, but a common way is to introduce a linear
function, so that the problem is to minimize:

for some constant C. To this minimization, the following modified constraint is added:


14

By using Lagrange Multipliers as before, the problem can be rewritten as:

Figure 2.4: The mapping

of input data from the input space into an infinitely


dimensional Hilbert Space in a non-linear SVM classifier. Source [17].
for

[3]. To get rid of the slack variables, one can also rewrite this problem

into its dual form:

subject to the constraints

and

It is worth mentioning that there are also non-linear classifiers. While linear classifiers
require that the data be linearly separable, or nearly so, in order to give qualitative
results, the input data is transformed as to become linearly separable. In a non-linear
classifier, the input vectors
are transformed as to lie in an infinitely dimensional
Hilbert Space where it is always possible to linearly separate the two data categories.


×