Tải bản đầy đủ (.pdf) (4 trang)

Báo cáo khoa học: "Multilingual adaptations of ANNIE, a reusable information extraction tool" doc

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (710.78 KB, 4 trang )

Gate Unicade Editar
-
Unicade Sampler-tat
File Edit Options Help
H
IL
IC[411
'Anal
UrocodeirIS

9 ■
Arabic:
13.0
3
zp-
J
.11
JSI
J
S
JI9
U.
Armenian:
tipEnni Duval& lau.tht 11.6.241 laUlnalagbula
Chinese:
1
-
kkl'
,3
tT•
.


-ftals
,
c.
Farsi I
Persian:
j3
Georgian: 8063b
3368

66).5. 863033.
Hebrew:
-
.5 pnn s5
no
royipt 5m5 513-
.
'PC
Hindi:
4
isbi T
1tsd
I


31
t
4tsi
d4f
Japanese:
f1

,
14/37 tIt's:641
-
1
-
0
-t-tiafokt*-)(-mth „
Korean:
1-1
-
L -W-21ff

VoiR.
Marathi:
4t
ti
-1
/9113) 9I
-
Wa,
1:
MT
TA
-
d
-
ff41
-
.
Pashto:

.41
4
9

aseP
Sanskrit:


vITH
I
Thai:
iuAttn9t.
-
anlei
Urdu:
-
bcej31
Las"
Yiddish:
UEP1
-
rn Lou DY
px w5a
Inv VP
TS.
Multilingual adaptations of ANNIE, a reusable information
extraction tool
Diana Maynard, Hamish Cunningham
Dept of Computer Science
University of Sheffield

Sheffield, Si 4DP, UK

Abstract
In this demo we will present GATE, an archi-
tecture and framework for language engineer-
ing, and ANNIE, an information extraction sys-
tem developed within it. We will demonstrate
how ANNIE has been adapted to perform NE
recognition in different languages, including In-
dic and Slavonic languages as well as Western
European ones, and how the resources can be
reused for new applications and languages.
1 Introduction
GATE' is an architecture, development envi-
ronment, and framework for building systems
that process human language (Cunningham et
al., 2002; Maynard et al., 2002). It has been
in development at the University of Sheffield
since 1995, and has been used for many R&D
projects, including Information Extraction in
multiple languages and media, and for multiple
tasks and clients. GATE is available freely, as
an open source system, under the GNU library
licence, and has been downloaded by around
2500 sites worldwide. The core architecture and
some applications developed within GATE have
been previously demonstrated (Cunningham et
al., 2002); however, this demonstration will fo-
cus on the multilingual aspects of GATE, and
adaptations of its IE system for different lan-

guages.
Version 2 of GATE has a large number of
added features from the previous version, such
as:

comprehensive multilingual support via
Unicode
'This work has been supported by the Engineering
and Physical Sciences Research Council (EPSRC) un-
der grants GR/K25267 and GR/M31699, and by several
smaller grants.

tools for performance evaluation

support for manual annotation

reusable visualisation components

database support (Oracle, PostgresQL)

support for distributed resources from the
Web

comprehensive document format support
(SGML, XML, HTML, RDF, email, plain
text)
Figure 1: Unicode text in Gate2
2 Processing Resources
GATE provides a baseline set of reusable and
extendable language processing components for

common NLP tasks, known collectively as AN-
NIE (A Nearly New Information Extraction
System). These include a Unicode tokeniser,
sentence splitter, POS tagger, gazetteer, seman-
tic tagger, name coreferencer (orthomatcher)
219
IMMFis AnnotatiooidA
Din surnarui
Nr.336/1111110.2002

Gnidul calatoriilorfa4a vita in
ball
Bilandul activitadii poobor
buzodani in
anul
111.(partea
b•Polidistii au posibOtatea legislativa de aid indeplini mai blue
sarcinileii misnimle cc le revin• A incep-ut cnn nou an de lupta cu infractorii: Printre
primii la calatorli fara vita,
5eribedn Nadine
din

a font prinsa pa aerop-ort
incercand nO fuga de datorii neonorate de circa
Un het din biserici
font intuit tocmai de pazni-cul care l-a prins/S-a spart biblioleca Casei de culkira
dar hodii
RIJ s-au afire de nici o carte/Explozie la sonda 304 Grajdana •Jurnal rutier
-
Ultima zi din an sub semnul imprudendei Somne papilare" de Col

Nicotae
Rutaru

"kitrnet pas cu pas" de Adrian firlacdru4-Tanasescu•invataturi din
arhivele brancovenes.ti•
MO
-AMA PERFOPMANTEI "Este momentui sa ne redescoperm pasiunea

p.
_1We
5e4
s
_BLOCK
Date
Romanian
20
34
kind=
date, rule= DateOnlyFinal)
Location
Romanian
73
79 Me= LocFinal}
Date
ilkoutan
I
an
132
136
{kind=

date, rure-DateOnlyFinal)
Person
Romanian
334
349 {rule= PersonFinal}
Location
Romanian
354
359 {rule= LocFinal}
Money
Romanian•434
_
451 {kincl=number, rule=MoneyCurrency
Person
Romanian

707
726
{rule=PersonFinal}
Person
Romanian
754
771{ruie.Per$onFl}
'MOW
and pronominal coreferencer. For more details,
see (Cunningham et al., 2002). ANNIE cur-
rently produces precision and recall figures for
named entity recognition of around 90%, de-
pending on the text type.
An online demo of ANNIE is available at

/>. A set of
movies demonstrating document and corpus
loading, processing and storing, manual an-
notation of documents and corpora, creat-
ing, running, saving and restoring applica-
tions and viewing their results is available at
http: / /www.gat e. ac. uk/
demos / movies .ht ml.
3 Multilingual support - the GATE
Unicode Kit
GATE is one of the few architectures to sup-
port multilingual processing, using Unicode as
its default text encoding. A Unicode enabled
graphical user interface (GUI)
needs to address
two main issues: the capability to display text
and the ability to enter text in other languages
than the default one.
It also provides a means of entering text in
a variety of languages and scripts, using virtual
keyboards where the language is not supported
by the underlying operating platform (Java it-
self does not support input in many languages
covered by Unicode, although it supports Uni-
code representation). Figure 1 depicts text in
various scripts displayed in GATE. The facili-
ties have been developed as part of the
EMILLE
project (Baker et al., 2002), designed to con-
struct a 63 million word corpus of South Asian

languages. There are currently 28 languages
supported in GATE, and more are planned for
the future. Since GATE is an open architecture,
new virtual keyboards can easily be defined by
users and added to the system.
Apart from the input methods, GUK also
provides a simple Unicode-aware text editor
which is important because not all platforms
provide one by default or the users may not
know which one of the already installed edi-
tors is Unicode-aware. Besides providing text
visualisation and editing facilities, the GUK ed-
itor also performs encoding conversion opera-
tions. The editor has proved a useful tool dur-
ing the development and testing of GATE in a
cross-platform environment, while the ability to
handle Unicode enables applications developed
within GATE to be easily ported to new lan-
guages.
4 The future isn't English
Robust tools for multilingual information ex-
traction are becoming increasingly sought af-
ter now that we have capabilities for processing
texts in different languages and scripts. While
the default IE system is English-specific, some
of the modules can be reused directly (e.g.
the Unicode-based tokeniser can handle Indo-
European languages), and/or easily customised
for new languages (Pastra et al., 2002). So far,
ANNIE has been adapted to do IE in Bulgarian,

Romanian, Bengali, Greek, Spanish, Swedish,
German, Italian, and French, and we are cur-
rently porting it to Arabic, Chinese and Rus-
sian, as part of the MUSE project
2
.
Figure 3: Romanian news text annotated in
GATE
4.1 NE in Slavonic languages
The Bulgarian NE recogniser (Paskaleva et al.,
2(02) was built using three main processing re-
sources: a tokeniser, a gazetteer and a semantic
grammar built using JAPE. There was no POS
tagger available in Bulgarian, and consequently
we had no need of a sentence splitter either.
The main changes to the system were in terms
2 />220
al Gate
17-1111 Applications
-
ej
,
BG system
CI.

Language Resources
=
A
Text EiG
Processing Resources

BG Names
t"
.
■ Unicode Tokeniser
=

BG Gazetteer
g
Data stores
Messages I
di
BO
system

Text BO
I
Emol
-
apgslee aerompe
8 METO4HEI EBFIOna AHM,I9
e
H8 crepes
reproahuln m chpaagnn ca e Sanagaa Espana.
tR
-
I'M
K
IHNHI
.
X8M

30 Mavr 2000


D
-
r
M
Cunningham
Dr.
M
Cunningham
D-rAHrenoaa
Type •
Set Stag
End
Features
Lookup
,Default,
71
78
trninorType=country, majorType=location}
Lookup
,Default>
60 68
trninorType=country, majorType=location}
Lookup
,Default>
37 43
trninorType=country, majorType=location}
Lookup

,Default>
0
8
trninorType=country, majorType=location}
Person
,Default>
210 216
tkInd=personName, rule=KallnaPersonl
Person
,Default>
173 179
tkInd=personName, rule=HamishPerson}
Person
,Default>
148 154
tkInd=personName, rule=HamishPerson}
Person
,Default>
108 113
tkind=personName, rule.HamishPersonBet
Person
,Default>
266 270
tkind=personName, rule=GaliaPersont
Person
,Default>
248 253
tkind=personName, rule=GaliaPersont
Person
,Default>

230 236
tkind=personName, rule=KalinaPersonl
E <Default>
1—T7 Lookup
F
SpaceToken
DA
Gate 2.0alphal
File Edit Tools Help
Figure 2: Bulgarian named entities in GATE
of the gazetteer lists (e.g. lists of first names,
days of the week, locations etc. were tailored for
Bulgarian), and in terms of some of the pattern
matching rules in the grammar. For example,
Bulgarian makes far more use of morphology
than English does, e.g. 91% of Bulgarian sur-
names could be directly recognised using mor-
phological information. The lack of a POS tag-
ger meant that many rules had to be specified in
terms of orthographic features rather than parts
of speech. Figure 2 shows some Bulgarian text
annotated in GATE.
Since the structure of the Bulgarian and
Russian languages is quite similar, we antic-
ipate that converting the Bulgarian system
to Russian will be fairly straightforward, and
will involve mostly replacing and/or updating
gazetteer lists at least to obtain comparable
results.
4.2 NE in Romanian

The Romanian NE recogniser (Hamza et al.
;
2002) was developed from ANNIE in a simi-
lar way to the Bulgarian one, using a tokeniser,
gazetteer and a JAPE semantic grammar. Fig-
ure 3 shows some Romanian text annotated in
GATE.
Romanian is a more flexible language than
English in terms of word order; it is also agglu-
tinative e.g. definite articles attach to nouns,
making a definite and indefinite form of both
common and proper nouns.
As with Bulgarian, the tokeniser did not need
to be modified, while the gazetteer lists and
grammar rules needed some changes
;
most of
which were fairly minor. For both Bulgar-
ian and Romanian, the modifications necessary
were easily implemented by a native speaker
who did not require any other specialist skills
beyond a basic grasp of the JAPE language and
the GATE architecture. No Java skills or other
programming knowledge was necessary. The
Gate Unicode kit was invaluable in enabling the
preservation of the diacritics in Romanian, by
saving them with UTF-8 encoding.
4.3 NE in other languages
ANNIE has also been adapted to perform NE
recognition on English, French and German di-

alogues in the AMITIES project
3
, a screenshot
of which is shown in Figure 4. Since French and
German are more similar to English in many
ways than e.g. Slavonic languages, it was very
easy to adapt the gazetteers and grammars ac-
cordingly.
3
/>221
M
ArruChat

7EIEF
File
AMITIES: Amities, how can I help, je vous ecoute, was lcann ich ftir Sie tun?
USER: Guten Tag, Ich mochte gem wissen, movie! ich out metnem Konto hate
AMITIES: Darf ich bitte Ihre Kontonummer haben?
USER: Mon numero est 5522333344945555
AMITIES: Pouvez-vous me donner votre adresse et votre date de naissance, s'il
vous plait?
USER: I live at 547 Ratchet Lane, and my date of birth is the 18th of May, 1969
AMITIES: Thank you. Here is the information about your account.
Figure 4: AMITIES multilingual dialogue
4.4 Surprise languages
We are currently investigating methods of
adapting ANNIE to new languages with the
minimum of resources and time. Our previous
experiments with languages other than English
have demonstrated that we can get reasonable

results in around 2 person months using a na-
tive speaker and hand-coded semantic tagging
rules, without requiring resources such as dictio-
naries or POS taggers for that language. We are
also planning participation in the TIDES-based
"surprise language experiment", which requires
various NLP tasks such as IE, IR, summarisa-
tion and MT to be carried out in a month on
a surprise language, the nature of which will
not be known in advance. The open and flex-
ible architecture of GATE, and the separation
of linguistic data from processing makes it an
ideal environment within which to perform such
a task. Any available linguistic resources such
as dictionaries and POS taggers can be simply
plugged into the model, but if these are not
available we can simply modify other compo-
nents as necessary.
5 Conclusion
In this demo we have shown how an existing set
of IE tools has been modified to a diverse set
of languages with minimum overhead. The ad-
vantage of having such low-overhead portability
is that it enables quick deployment of IE tools
with acceptable performance, which, even if not
developed in end-used applications
;
can be used
to bootstrap the creation of IE-annotated cor-
pora and/or facilitate the training of learning

tools for adaptive IE. In addition, some adap-
tive IE tools are now using the ANNIE corn-
ponents to provide them with richer linguistic
information (Ciravegna et al., 2002).
References
P. Baker, A. Hardie, T. McEnery, H. Cunning-
ham, and R. Gaizauskas. 2002. EMILLE,
A 67-Million Word Corpus of Indic Lan-
guages: Data Collection, Mark-up and Har-
monisation. In
Proceedings of 3rd Lan-
guage Resources and Evaluation Conference
(LREC'2002),
pages 819-825.
F. Ciravegna, A. Dingli, D. Petrelli, and
Y. Wilks. 2002. User-System Cooperation in
Document Annotation Based on Information
Extraction. In
13th International Confer-
ence on Knowledge Engineering and Knowl-
edge Management (EKAW02),
pages 122-
137, Siguenza, Spain.
H. Cunningham, D. Maynard, K. Bontcheva,
and V. Tablan. 2002. GATE: A Framework
and Graphical Development Environment for
Robust NLP Tools and Applications. In
Pro-
ceedings of the 40th Anniversary Meeting of
the Association for Computational Linguis-

tics.
0. Hamza, D. Maynard V.Tablan, C. Ursu,
H. Cunningham, and Y. Wilks. 2002. Named
Entity Recognition in Romanian. Techni-
cal report, Department of Computer Science,
University of Sheffield.
D.
Maynard, V. Tablan, H. Cunningham,
C. Ursu, H. Saggion
;
K. Bontcheva, and
Y. Wilks. 2002. Architectural elements of
language engineering robustness.
Journal of
Natural Language Engineering - Special Is-
sue on Robust Methods in Analysis of Natural
Language Data,
8(2/3):257-274.
E.
Paskaleva, G. Angelova, M.Yankova,
K. Bontcheva, H. Cunningham, and
Y. Wilks. 2002. Slavonic named enti-
ties in gate. Technical Report CS-02-01,
University of Sheffield.
Katerina Pastra, Diana Maynard, Hamish Cun-
ningham, Oana Hamza, and Yorick Wilks.
2002. How feasible is the reuse of grammars
for named entity recognition? In
Proceed-
ings of 3rd Language Resources and Evalu-

ation Conference.
222

×