Tải bản đầy đủ (.pdf) (78 trang)

Towards a framework for building an annotated named entities corpus

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (930.88 KB, 78 trang )

Table of Contents
1 Introduction 1
1.1 Overview Name Entity recognition(NER) . . . . . . . . . . . . . . . . 1
1.2 NER Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Rule based approach . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Machine learning Approach . . . . . . . . . . . . . . . . . . . 4
1.2.3 Comparing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Thesis contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Thesis structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Related Work 8
2.1 Overview our problem . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Building NER corpus research . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Researches about building corpus Process . . . . . . . . . . . . . . . . 10
2.4 Overview annotate tools . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Corpus building process 13
3.1 Corpus building process . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.2 Built annotation guide line . . . . . . . . . . . . . . . . . . . . 14
3.1.3 Annotate do cume nts . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.4 Quality control . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Building Vietnamese NER corpus by off-line tools . . . . . . . . . . . 20
3.2.1 Built annotation guide line . . . . . . . . . . . . . . . . . . . . 20
3.2.2 Annotate do cume nts . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.3 Quality control . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Discus about Vietnamese NER corpus building process. . . . . . . . . 26
3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
ii
TABLE OF CONTENTS iii
4 Online Annotation Framework 28
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28


4.2 Training section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 Annotation doc uments . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3.1 Online annotation interface . . . . . . . . . . . . . . . . . . . 31
4.3.2 Automate file distribution for annotator . . . . . . . . . . . . 32
4.3.3 Automate save and manage files . . . . . . . . . . . . . . . . . 33
4.4 Quality control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.4.1 Document level . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.4.2 Corpus level . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.4.3 Explain unusual entity . . . . . . . . . . . . . . . . . . . . . . 37
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5 Evaluation 39
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2 Corpus evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.2.1 Inter annotatetor agreements . . . . . . . . . . . . . . . . . . 41
5.2.2 Offline corpus evaluation . . . . . . . . . . . . . . . . . . . . . 42
5.2.3 Online corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.3 Time costing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3.2 Offline process . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3.3 Online framework . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.4 Named entity recognition system . . . . . . . . . . . . . . . . . . . . 51
5.4.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.4.2 Gazetteer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.4.3 Transducer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.4.4 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6 Conclusion And Future wo rk 60
6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.2.1 Create corpus bigger and more quality . . . . . . . . . . . . . 62

6.2.2 Improve online annotation framework . . . . . . . . . . . . . . 63
6.2.3 Building NER system base statistical . . . . . . . . . . . . . . 63
iv TABLE OF CONTENTS
A Name Entity guideline 64
A.1 Basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
A.1.1 Entity and Entity Name . . . . . . . . . . . . . . . . . . . . . 64
A.1.2 Instance of entity . . . . . . . . . . . . . . . . . . . . . . . . . 64
A.1.3 List of Entities . . . . . . . . . . . . . . . . . . . . . . . . . . 64
A.1.4 Entities recognize rules . . . . . . . . . . . . . . . . . . . . . 65
A.2 Entity classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
A.2.1 Person . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
A.2.2 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
A.2.3 Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
A.2.4 Facility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
A.2.5 Religion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
List of Figures
3.1 Process building Annotation guide line . . . . . . . . . . . . . . . . . 21
3.2 Callisto formatting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Callisto interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4 Comparing two user corpus . . . . . . . . . . . . . . . . . . . . . . . 25
4.1 Online Annotation Process . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Annotation online tools Interface . . . . . . . . . . . . . . . . . . . . 31
4.3 Annotation gudeline form Interface . . . . . . . . . . . . . . . . . . . 32
4.4 Review Tool Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.5 Compare two documents interface . . . . . . . . . . . . . . . . . . . . 36
5.1 Inter Annotation Agreements result of two User . . . . . . . . . . . . 43
5.2 Evaluate accuracy rate for each Entity kind . . . . . . . . . . . . . . 44
5.3 Evaluate online corpus accuracy rate for each entity kind . . . . . . . 47
5.4 Name entity recognition system architecture . . . . . . . . . . . . . . 52
5.5 Jape rule to recognize Person entity . . . . . . . . . . . . . . . . . . . 55

5.6 Performance on the training data using strict criteria . . . . . . . . . 57
5.7 Performance on test data using strict criteria . . . . . . . . . . . . . . 57
5.8 Performance on the test data using lenient criteria . . . . . . . . . . . 58
v
List of Tables
5.1 An example of par corpus which annotate bu two user (User A and
user B) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2 frequency annotated documents . . . . . . . . . . . . . . . . . . . . . 45
5.3 Inter annotation agreements in online annotation . . . . . . . . . . . 46
5.4 User corpus accurate rate in online method . . . . . . . . . . . . . . . 46
5.5 Time spent to quality control corpus . . . . . . . . . . . . . . . . . . 49
5.6 Time spent During annotation process . . . . . . . . . . . . . . . . . 50
5.7 Quality control time in online framework . . . . . . . . . . . . . . . . 51
vi
Chapter 1
Introduction
1.1 Overview Name E ntity recognition(NER)
The ability to determine the named entities in a text has been established as an
important task for several natural language processing areas, including information
retrieval, machine translation, information extraction and language understanding.
The term ”Named Entity” widely use in Nature Language Processing(NLP), was
coined for the Sixth Message Understanding Conference(MUC-6). At the time,
MUC was focusing in Information Extraction(IE) tasks where structured informa-
tion of computer activities and defense related activities is extracted from unstruc-
tured text,such as newspaper articles. In defining tasks,people noticed that it is
essential to recognize information units like names including: Person, organization
and location names and numerics expression including: time, date, money, percent
expression. Identifying references to these entities in text was recognition as one
of the importance sub- task of IE and was called ”Name Entity Recognition and
Classification”.

The computational research aiming at automatically identifying named entities
1
2 Chapter 1. Introduction
in texts forms a vast and heterogeneous pool of strategies, methods and represen-
tations. One of the first research papers in the field was presented by Lisa F. Rau
(1991) at the Seventh IEEE Conference on Artificial Intelligence Applications. In
genreral, each NER researches which have been devoted have to solve four problems:
Language, Input,Kind of entity, and learning method.
Languages:
NER have been applied to several languages. There are many good researches for
English NER, they have solve d language independence and multilingualism prob-
lems. German is well studied in CONLL-2003 and in earlier works. Similarly,
Spanish and Dutch are strongly represented, boosted by a major devoted confer-
ence: CONLL -2002 (Collins, 2002). Chinese is studied in some researches (Wang
et al., 1992),(Computer et al., 1996), (Yu et al., 1998) and so are French (Petasis
et al., 2001), (And, 2003), Greek (Karkaletsis et al., 1999) and Italian (Black et al.,
1998), (Cucchiarelli & Velardi, ). Many other languages received some attention as
well: Basque (Whitelaw & Patrick, 2003), Bulgarian (Silva et al., 2004), Catalan
(Carreras et al., 2003),Hindi (Cucerzan & Yarowsky, 1999), Romania (Cucerzan &
Yarowsky, 1999), Swedish (Kokkinakis, 1998) and Turkish (Cucerzan & Yarowsky,
1999). Portuguese was examined by(Palmer et al., 1997).
In Vietnamese, there are some NER research is apply, for example VN- KIM
(Nguyen & Cao, 2007)IE system have just
Format input
NER research have been applied to many format of documents: General text,
email, scientific text, journalistic,ect and mamy domain: sport, business,literature,
etc. Each system usually direct specific format and domain (Maynard et al., 2001).
Designed a system for email, scientific texts and religious texts (Minkov & Wang,
2005) created a system specifically designed for email documents. Now day, studies
1.2. NER Approach 3

want to apply to newer kind of format and domain. For example, MUC-6 collection
composed of newswire texts, and on a proprietary corpus made of manual transla-
tions of phone conversations and technical email
Kind of Entity
Although list entities depend kind and domain specific problems, NER systems
usually record some entities: Person, Location,Organization, Date, Time, Money,
Percent. Ambiguous have been appeared by Person, Location,Organization and
other is f ewer. In Each domain, NERs target some specific one. For instance, in
medicine domain, entity can be mane of disease or name of medicine.
1.2 NER Approach
Similar to other NLP problems NER research have been developed into two main
approaches:
• Rule based approach.
• Machine learning approach.
1.2.1 Rule based approach
Using expert system to built rule system is traditional approach and they have
been applied in NLP in general and NER in particular. Rule system is set of rule
which have been built by people (in ordinary expert) to particular target. Rules
will create by some features: Part of speech, context( words and phrases are in front
of words and behind one etc ) and some properties(Uppercase, lowercase ) and
some special dictionaries. For example:
4 Chapter 1. Introduction
President Busto leave Iraq said Monday’s talks will include discussion on security,
a timetable for U.S forces
In this example, ”Busto” appear after the ”President”,for this reason ”Busto” is
snnotated as Person entity. Similar, ”Iraq” appear before ”leave” verb so that it
is seemed ”Location’ entity. In this approach, we don’t need to annotate corpus.
System can be identified and classified immediately by set of rules. Advantage of
approach is: easy to built rule base system. So that many NER systems is rule base
system since first period. However, it is difficult to enhance accuracy rate. Because

organize set of rules is difficult. If we do not organize appropriately their, the rules
is overlap each other, and system can not identify and classify correctly.
1.2.2 Machine learning Approach
Now day, machine learning is common approach to solve NLP problems. In NER,
it is used to enhance accuracy. Thes e are some model have been applied: support
vector machine, Hidden Markov model, decision tree, etc There are three kinds of
learning method have been applied in Machine learning: Un-supervised, supervised,
and semi-supervised. However, Un-supervised systems and semi-supervised don’t
not for NER problems. There are a few researchs apply these methods: for example:
Collins with system used un-annotate corpus (Collins & Singer, 1999). And Kim
with system using proper name and un- annotate corpus. Systems which are applied
supervised used more popularly in NER problems. For example:Bikel with hidden
markow model(Black et al., 1998) ,Borthwick with Maximum Entropy (Borthwick
et al., 1998), etc In Machine Learning systems, we must built three sets: training
set, test set and practice set.
• A training set consists of an input vector and an answer vector, and is used
1.2. NER Approach 5
together with a supervised learning method to train a knowledge for the sys-
tem. In NER, a training set is a corpus which have been annotated standard
labels.
• A Test set is similar to training set. But target of test set is check and evaluate
system accuracy. In NER problem, test set is a corpus which similar to train
set.
• Practice set: is set which is applied machine leaning system to automatically
identify and classify. Execute practice set is goal to built system.
1.2.3 Comparing
Annotation based learning have some advantages from manual hand writing rule:
• Annotation based learning can continue indefinitely, over weeks and months,
with relatively self contained annotation decision at each point.In contrasts
rule writing must remain cognizant of potential previous rules interdependen-

cies when adding and revising rules,ultimately bounding continued rule system
growth by cognitive load factor.
• Annotation by learning can more effective combine the effort of multiple peo-
ple. The tagged sentences from deference data sets can be simple concatenated
to form larger data sets with broader coverage.
• User who write rule require large skill, including not only linguistic knowledge
for annotation, but also competence regular expression and ability to grap the
complex interactive within rule list. However, in machine learning approaches,
annotators only require can used fluently language.
6 Chapter 1. Introduction
• Performance of system which built by rule writer tend to exhibit considerably
more variance. While machine system tend to much more consistent result.
Although machine learning approach have a lot of advantages. However we meet a
main barrier: machine learning need a high quality corpus. So that the problem is
how to build a high quality copus.
For Vietnamese, There is not any NER corpus is published. Although some
systems have been built based on machine learning approach, they don’t share theirs
corpus. So that it is difficult to other research improve accurate for NER system.
For this reason, my thesis focus:
• Solutions to build Vietnamese NER corpus
• Quality control and evaluate the corpus.
• Apply the corpus into NER problem.
1.3 Thesis contribution
The thesis contribution includes:
• We release a building corpus pro cess base on
• We apply the process to build NER corpus by offline tools method. Offline
tools method is a manual way use desktop programs, for example: Callisto
tool. Offline tools method is called as offline tools.
• To overcome offline tools disadvantage, We build a online annotation frame-
work. The online frame work have some features

– Annotation will be executed though Internet environments (Annotate
anytime, anywhere).
1.4. Thesis structure 7
– Automate all steps in process: Manage files, distribute to annotator, etc
– Enable lager number annotator.
– Quality control corpus in many level.
• Apply corpus to evaluate our NER system.
1.4 Thesis structure
So that, my thesis including five chapters
• Chapter one: Introduction: Overview NER research and some approach
to built NER system.And We expose problem.
• Chapter two: Related word: Overview some research in the world to built
NLP corpus in general and NER corpus in particular. So that we localize my
directly study.
• Chapter three: Building corpus process: Describe a process build a
general corpus. Then, we apply to build Vietnamese corpus by off line tools.
• Chapter four: Online corpus Framework: We base on building corpus
process to build a online framework for annotating. It will overcome off-line
tools disadvantage.
• Chapter five: Evaluation: Present about my experiments and evaluate
result. And describe our NER system using corpus we built.
Chapter 2
Related Work
In this chapter, we discus about some published building corpus research, Include
NER corpus and NLP corpus We discust about some factors:
• Process building
• Support tools
• Quality control
For learning exist research we build own strategy to solve our problem.
2.1 Overview our problem

As we in last chapter, building a high quality NER corpus is very important. The
corpus will be used many ways in NER system:
• Testing system: Corpus will be used to evaluate system accuracy rate.
• Training system: Corpus will be used to build system knowledge (Machine
learning approach).
8
2.2. Building NER corpus research 9
However building high quality corpus is not easy. If you don’t have suitable method,
you only have a low accuracy corpus. So that our problem is:
How to build a high quality NER corpus, and how to quality control
corpus.
We need do three works to solve problem: Release a building corpus process,
supply tools to support, and quality control.
2.2 Building NER corpus research
When survey the problem we see that. Building NER corpus problem is not new in
the world. For example:
Kravalov´a, Jana and
ˇ
Zabokrtsk´y, Zdenˇek have built Czech Named Entity corpus
(Kravalov´a &
ˇ
Zabokrtsk´y, 2009). In this research, 6000 sentences are manually
annotated named entities. They receive about 33000 entities. They use the corpus
to train and evaluate a named entity recognizer based on Support Vector Machine
classification technique. The presented recognizer outperforms the results previously
reported for NE recognition in Czech.
Furthermore, Asif Ekbal and Sivaji Bandyopadhyay have built Bengali Named
Entity Tagged Corpus (Asif Ekbal, 2008). A Bengali news corpus has been de-
veloped from the web archive of a widely read Bengali newspaper. They used tool
”Sanchay Editor1” to manual annotate, Sanchay Editor1 is a text editor for Indian

language. Their corpus has been used to develop NER system in Bengali use pat-
tern directed shallow parsing approaches, includes: Hidden Markov Model (HMM),
Maximum Entropy (ME) Model, Conditional Random Field (CRF) and Support
Vector Machine (SVM).
10 Chapter 2. Related Work
There is no NER copus is publish for Vietnamese language. So that, some NER
system have been based creating rule approach, for example: VN-KIM (Nguyen &
Cao, 2007)(using Jape grammar) To release Vietnamese NER corpus will useful for
developing automatically NER researches.
2.3 Researches about building corpus Process
Many building corpus research are published, and many corpus is created: POS
corpus, TreeBank corpus, event newer corpus: Parallel language corpus, Opinion
corpus, etc For example:
• Towards the national corpus of Polish research (Adam Przepiorkowski &
Lazinski, 2008) study about building National Corpus of Polish and used to
build Polish dictionary. The corpus is very big, about a billion words. The
corpus have been built by four parters, and they annotated various features,
entire corpus will be annotated linguistically, structurally and with the meta
data. During building time, they plan to carefully consider the recommenda-
tions of the ISO/TC 37/SC 4 subcommittee, the TEI guidelines, any future
recommendations of the CLARIN project
1
• ”Building a Greek corpus for Textual Entailment” research (Evi Marzelou
& Piperidis, 2008)study about building Greek corpus. Annotation process in
the research includes some steps: Create guidelines, annotate (by expert and
non-expert human annotator). They c ompare and release the gold entailment
annotation.
• The research ”Opinion annotation in On-line Chinese Product Reviews” (Ruifeng Xu
1
more information in web />2.4. Overview annotate tools 11

& Li, 2008) focus about opinion annotation. The research will explain about
annotation schema. It includes seven steps.
Summary, after we review some create annotation corpus research. We see that
building corpus schemma include three main steps: Build annotation guide line,
Annotate, and quality control corpus. So that our corpus will be applied these
steps.
2.4 Overview annotate tools
These are many annotate tools exist: we can reference them:
• GATE
2
: The framework written by Java languages. It includes many func-
tions, Annotation is Gate ’s function.
• Callisto
3
: The Callisto annotation tool was developed to support linguistic
annotation of textual sources for any Unicode-supported language
• EasyRef
4
: It is a web service to handle (view, edit, import, export, bug
reports) syntactic annotations.
• SACODEYL Annotator
5
: It is a open source application to annotate
documents in desktop, furthermore it can be a web application.
• WordFreak
6
WordFreak is a java-based linguistic annotation tool designed to
support human, and automatic annotation of linguistic data as well as employ
active-learning for human correction of automatically annotated data.
We will ref erence all tools to build our tools for annotate pro cess .

2
/>3
/>4
/>5
/>6
/>12 Chapter 2. Related Work
2.5 Summary
In this chapter, we focus about some related works around the thesis includes:
building corpus process, annotation tools.It is useful to direct our word: forward a
framework for building an Annotated Named entities corpus. In next chapters, we
have explain our work to build Vietnamese NER Corpus.
Chapter 3
Corpus building process
In this chapter, we present about corpus building process. Similar other annotated
process, corpus building process includes three steps: Built annotation guide line,
annotate documents, and quality control. Then, we apply the process to build
Vietnamese NER corpus. We use some off-line tools and discuss about advantage
and disadvantage.
3.1 Corpus building process
3.1.1 Objective
In this subsection, we explain the importance of building process. If you want to
build a small corpus (Corpus contains a few documents) you do not need a corpus
building process. Simple, you annotate each documents by annotate tools. If you
want corpus more accurate, the documents are annotated some times. However,
when you want to build a large corpus. The work becomes complex. Many people
need join in the job. So that building process corpus is defined. Basing on corpus
building process, people will know what work they have to do. Manager manage
13
14 Chapter 3. Corpus building process
more easy all works and corpus quality. Requirements of the building corpus process

are list
• Every people takes part in the corpus building.
• Each documents have to be annotated many times
• Administrator can control and evaluate quality of corpus as quality of anno-
tator ’s work
As research we have studied in section two chapter two such as: National corpus
of Polish (Adam Przepiorkowski & Lazinski, 2008), building a G reek corpus for
Textual Entailment (Evi Marzelou & Piperidis, 2008), opinion annotation in On-
line Chinese Product Reviews (Ruifeng Xu & Li, 2008). Corpus building process
include three steps:
• Building annotation guide line.
• Annotation documents.
• Quality control corpus.
In next section we will present each steps.
3.1.2 Built annotation guide line
Annotation guide line is nearly a user manual for annotator. They base on instruc-
tions which is contained in guide line to find and classify entity. In building NER
corpus, guide line include some contents: define a name entity, classify entities and
sign of entity in documents. Annotation guide line is very important because:
• Annotators seem guide line as theirs user manual to annotate correctly . Before
annotation pro cess , they have to read and study carefully guide line . They
3.1. Corpus building process 15
have to knows: which word or phrase can be seem entity, Identification of each
entity kinds. If they do not understand cle arly, they face many problems when
they annotate, and many error will be made.
• When face ambiguous case (The case can be understood many ways). Base on
the rules in guide line. Annotator will decide which way is the most correct.
For example when annotation sentences:
Trưởng công an huyện Kỳ Sơn dẫn tôi đi tới kho chứa hàng trăm
khẩu súng tự chế được gom lại trong chiến dịch thu hồi vũ khí và

vật liệu nổ trái phép
Ky Son police chief take me to hundreds of manual weapons which
have been gathered in inlegal Weapons and detonation materials
gathered campaign
There are two ways to annotation in the sentence : First "Kỳ Sơn"
is "Location" entity because it is a district name. In other way,
"Công an xã Kỳ Sơn"is "Organization" entity because it is a office
name. Which way do we choose?. In annotation guide line, we show
that: "Entity is not annotated overlap, and the most correct entity
is longest entity". So that the way two is applied.
• Because there is only correct entity in one context, when we compare two
documents which have been annotated by two difference one, if we found the
differences. It demonstrates that one is correct and other is wrong, even both
of them is wrong. To repair them we have to base guideline. For example
when annotation the sentence:
16 Chapter 3. Corpus building process
Bốn mươi năm trước chợ Mường Xén chưa xuất hiện thép Thái
Lan.
there was not Thailand Steel in Muong Xen market
Some people annotate Mường Xén is "Location" other is "Facility"
. Because b efo re the word is chợ(market) word base on annotated
guide line Mường Xén have to be "Facility" entity.
Summary, annotation guideline is very important to quality of corpus. So that
we have to built guide line the most correct and corresponding with our language.
In general annotation guide line is built in some ways: built a new one or repair
existing.
3.1.3 Annotate documents
After we have a complete annotation guide line, We begin annotate documents step.
This step include some work:
• Manager divide set of documents into groups. The groups will be assign to

annotators. Each documents can be divide into many groups( at least two
groups).
• Annotator receive own group. They used tool to annotate documents. For
accuracy and unprejudiced, annotator work independently. During annota-
tion period, they reference annotation guideline to decide which tag we will
annotate. In this period, annotators read annotation guide line more carefully,
they annotate more exactly. The annotate tool more friendly and convenient,
annotation pe rformance is higher.
• After annotator finish their work, they give back annotated documents to
manager. The manager must organize and save them.
3.1. Corpus building process 17
This is a important step in corpus building process. If people annotate carefully,
volume of work in next steps is reduced considerable. Otherwise, next step will work
very hard, sometime annotated documents are even re-annotated. It spent a lot of
time process.
3.1.4 Quality control
In this steps, supervisor find and repair all errors in annotated documents. Therse
are some kinds of error occurrent in annotated documents:
Target of process is finding all errors in annotation process. Include some errors:
• No Annotate entity
For example: this sentence is not added entity which is bold.
Đi từ Hà Nội theo đường một cũ khoảng 30 km là tới xã Thống
Nhất.
Thong nhat village is about 30 Km from Ha Noi by One road
• Annotate the word or phrase which is not a entity. For example: this sentence
is added non-entity which is bold.
chiếc <Facility>Inova</Facility> đang đi trên đường
A Inova car is running in the road
• Annotate incorrect tag.
For example: In this sentence Long Biên is Facility in stead of Lo-

cation.
nó đi đến chợ <Location>Long Biên</Location>
He is go to Long Bien market
18 Chapter 3. Corpus building process
• Annotate incorrect word or phrase, for example:
In this sentence Long Biên is entity - facility but chợ Long Biên is
not entity.
nó đi đến <Facility>chợ Long Biên</Facility>
He is go to Long Bien market
We check errors according two levels: Documents level, corpus level. After correct
all error, we gain a set of annotated documents without error. It is a standard
corpus we want.
Documents level:
In this levels supervisor find errors by comparing double annotated documents
(They are annotated independently from a root documents by two annotators). If
there are difference between two documents, supervisor will check the difference.
They base on annotation guide line and context around the difference to decide
which documents is correct which documents is wrong, even both two documents is
wrong (Notice at same context, only one case is corrects). In general, tool is applied
to automatically find all difference between double annotated do cume nts.
Corpus level:
If we only used documents level, we can not find errors in some case, for example:
The documents are annotated only one time, or both annotators are wrong. So that
we need check corpus in higher level - corpus level. We will explain about how to
find error in these cases.
First case, the word or phrase A is recognized B entity kind many times in the
corpus. However corpus appear the word A with C entity kind. Is it wrong? We
have to list all of unusual annotated kind and explain them. For example:
3.1. Corpus building process 19
The word Hồ Chí Minh is annotated person entity many times in the

corpus. However in one documents, Hồ Chí Minhis annotated location
entity.
To solve problem, we show entity and its context. Base on annotation guide line
and its context, we explain it is correct or wrong. For example, this sentence is
annotated correct:
<Location>Hồ Chí Minh</Location> làthành phố lớn nhất cả nước.
Ho Chi Minh is the biggest city in the country
For finding unusual entities, we list all entities in the corpus and their frequency. The
entities have low frequency is unusual entity. We explain them base their context.
To correct all error in this case, precision of corpus will be increase.
Second case, the word or phrase A is recognized B entity kind many times in
the corpus. However the word in the document c is not annotated. We need find
them and explain this case. Although, it is difficult to explore them, but it will be
increase corpus recall. We list all the word or prase in the all documents similar
to word or phrase in the corpus. We will explain them base on these context and
annotation guide line. For example this sentence TOYOTA is not a entity although
it is a organization in the corpus.
chiếc xe TOYOTA của tôi lại bị hỏng.
the TOYOTA is broken down
By all task in quality control step, we hope find and correct all errors in the
corpus. We will build a high quality corpus. Summary, In this section we present
a corpus building process. It includes three steps: built annotation guide line,
annotate documents and quality control. It can be apply for NER corpus building
20 Chapter 3. Corpus building process
and other corpus. In next section, we will apply the process to build a Vietnamese
NER corpus by using off-line tools: Callisto tool and quality tool.
3.2 Building Vietnamese NER corpus by off-line
tools
In this section, we apply the corpus building process to build Vietnamese NER
corpus. We have to act manually three steps: build annotation guide line, annotate

documents, and quality control.
3.2.1 Built annotation guide line
Vietnamese NER annotation guide line will build base on exist one for English NER.
We reference Simple Named Entity Guidelines V6.4 (Strassel, 2006) to build we raw
annotation guide line.
• What is Entity? What word or phase do we identify entity?.
• Numb er kind of entity?
• Which is case not entity?.
• Identification for each kind of entity.
Fistly, there are seven entity kinds in row annotation guide line: Person (PER),
Organization (ORG), Location (LOC), Geo-Political Entity (GPE), Facility (FAC),
Religion (REL) and Nationality (NAT).
We have to repair the guide line for approximate Vietnamese NER corpus build-
ing. We annotated set of documents base on guide line and finding error. We explain

×