Tải bản đầy đủ (.pdf) (4 trang)

DSpace at VNU: Towards a framework for building an annotated named entities corpus

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (166 KB, 4 trang )

Towards a framework for building an
annotated named entities corpus
Hoàng Hữu Sơn
Trường Đại học Công nghệ
Luận văn Thạc sĩ ngành: Khoa học máy tính; Mã số: 60 48 01
Người hướng dẫn: PGS.TS. Phạm Bảo Sơn
Năm bảo vệ: 2010
Keywords. Mạng thông tin; Công nghệ thông tin; Ngôn ngữ tự nhiên; Trí tuệ nhân
tạo

Content
Table of Contents

1

2

Introduction
1.1 Overview Name Entity recognition(NER) ...................................................
1.2 NER Approach .............................................................................................
1.2.1 Rule based approach ........................................................................
1.2.2 Machine learning Approach ............................................................
1.2.3 Comparing .......................................................................................
1.3 Thesis contribution .......................................................................................
1.4 Thesis structure.............................................................................................

1
1
3
3
4


5
6
7

Related Work
Overview our problem..................................................................................
Building NER corpus research .....................................................................
Researches about building corpus Process ...................................................
Overview annotate tools ...............................................................................
Summary .....................................................................................................

8
8
9
10
11
12

2.1
2.2
2.3
2.4
2.5

3

Corpus building process
13
Corpus building process ............................................................................... 13
3.1.1 Objective .......................................................................................... 13

3.1.2 Built annotation guide line .......................................................... 14
3.1.3 Annotate documents ........................................................................ 16
3.1.4 Quality control ................................................................................ 17
3.2 Building Vietnamese NER corpus by off-line tools ..................................... 20
3.2.1 Built annotation guide line .......................................................... 20
3.2.2 Annotate documents ........................................................................ 22
3.2.3 Quality control ................................................................................ 24
3.3 Discus about Vietnamese NER corpus building process ............................. 26
3.1


3.4

Conclusion ................................................................................................... 27
4 Online Annotation Framework
28
4.1 Introduction.................................................................................................... 28
4.2 Training section ............................................................................................. 29
4.3 Annotation documents ................................................................................... 30
4.3.1 Online annotation interface .............................................................. 31
4.3.2 Automate file distribution for annotator ........................................... 32
4.3.3 Automate save and manage files ....................................................... 33
4.4 Quality control ............................................................................................... 34
4.4.1 Document level .................................................................................. 34
4.4.2 Corpus level ....................................................................................... 35
4.4.3 Explain unusual entity ....................................................................... 37
4.5 Conclusion ..................................................................................................... 38
5

6


Evaluation
39
5.1 Introduction ................................................................................................... 39
5.2 Corpus evaluation ......................................................................................... 40
5.2.1 Inter annotatetor agreements ............................................................ 41
5.2.2
Offline .........................................................................................corpus
evaluation 42
5.2.3
Online ..........................................................................................corpus
45
5.3 Time costing ................................................................................................. 47
5.3.1 Overview .......................................................................................... 47
5.3.2
Offline .........................................................................................process
48
5.3.3
Online framework ..................................................................... 49
5.4 Named entity recognition system .............................................................. 51
5.4.1 Preprocessing .................................................................................... 52
5.4.2 Gazetteer ........................................................................................... 54
5.4.3 Transducer ........................................................................................ 54
5.4.4 Experiment ........................................................................................ 56
5.5 Summary ....................................................................................................... 58

Conclusion And Future work
60
6.1 Conclusion .................................................................................................... 60
6.2 Future work ................................................................................................... 62

6.2.1 Create corpus bigger and more quality ............................................. 62
6.2.2 Improve online annotation framework ............................................. 63
6.2.3 Building NER system base statistical ............................................... 63
A Name Entity guideline
64
A.1 Basic
concepts ....................................................................................... 64
A.1.1 Entity and Entity Name .................................................................... 64
A.1.2 Instance of entity .............................................................................. 64
A.1.3 List of Entities .................................................................................. 64
A.1.4Entities recognize rules ....................................................................... 65
A.2 Entity classification .................................................................................... 65
A.2.1 Person .............................................................................................. 65
A.2.2 Organization .................................................................................... 67
A.2.3 Location ........................................................................................... 68
A.2.4 Facility ............................................................................................. 69
A.2.5 Religion ............................................................................................ 69


References
Adam Przepiorkowski, Rafal L. Gorski, B. L.-T., & Lazinski, M. (2008). Towards the
national corpus of polish. Proceedings of the Sixth International Language Resources and
Evaluation (LREC’08). Marrakech, Morocco: European Language Resources Association
(ELRA). />And, T. P. (2003). The multilingual named entity recognition framework.
Asif Ekbal, S. B. (2008). Development of bengali named entity tagged corpus and its use in
ner systems. The 6th Workshop on Asian Languae Resources, 2008.
Bermingham, A., & Smeaton, A. F. (2007). A study of inter-annotator agreement for
opinion retrieval.
Black, W., Rinaldi, F., & Mowatt, D. (1998). Facile: Description of the ne system used for
muc-7. In Proceedings of the 7th Message Understanding Conference.

Borthwick, A., Sterling, J., Agichtein, E., & Grishman, R. (1998). Nyu: Description of the
mene named entity system as used in muc-7. In Proceedings of the Seventh Message
Understanding Conference (MUC-7.
Carreras, X., Marquez, L., & Padro, L. (2003). Named entity recognition for catalan using
spanish resources. In Proceedings of EACL’03.
Collins, M. (2002). Coll02: Ranking algorithms for named entity extraction: Boosting and
the voted perceptron. Association for Computational Linguistics.
Collins, M., & Singer, Y. (1999). Unsupervised models for named entity classification. In
Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language
Processing and Very Large Corpora (pp. 100-110).
Computer, D. O., hsi Chen, H., & chang Lee, J. (1996). Identification and classification of
proper nouns in chinese texts hsin-hsi chen and jen-chang lee. Proceedings of 16th
International Conference on Computational Linguistics (pp. 222-229).
Cucchiarelli, A., & Velardi, P. Unsupervised named entity recognition using syntactic and
semantic contextual evidence.
Cucerzan, S., & Yarowsky, D. (1999). Language independent named entity recognition
combining morphological and contextual evidence (pp. 90-99. ).
Disambiguation, W. S. (2008). A case study on inter-annotator agreement for word sense
disambiguation.
Evi Marzelou, Maria Zourari, V. G., & Piperidis, S. (2008). Building a greek corpus for
textual entailment. Proceedings of the Sixth International Language Resources and
Evaluation (LREC’08). Marrakech, Morocco: European Language Resources Association
(ELRA). />Karkaletsis, V., Paliouras, G., Petasis, G., Manousopoulou, N., & Spyropoulos, C. D.
(1999). Named-entity recognition from greek and english texts. Journal of Intelligent and
Robotic Systems, 26, 123-135.
Kokkinakis, D. (1998). AVENTINUS, GATE and Swedish Lingware. Proceedings of the
11th NODALIDA Conference (pp. 22-33). Copenhagen.
Kravalova, J., & Zabokrtsky, Z. (2009). Czech named entity corpus and svm-based



recognizer. NEWS ’09: Proceedings of the 2009 Named Entities Workshop: Shared Task on
Transliteration (pp. 194-201). Morristown, NJ, USA: Association for Computational
Linguistics.
Maynard, D., Tablan, V., Ursu, C., Cunningham, H., & Wilks, Y. (2001). Named entity
recognition from diverse text types. In Recent Advances in Natural Language Processing
2001 Conference, Tzigov Chark.
Minkov, E., & Wang, R. C. (2005). Extracting personal names from emails: Applying
named entity recognition to informal text. In HLT-EMNLP.
Nelson, K. P., & Edwards, D. (2007). Population-based measures of agreement.
Nguyen, T.-V. T., & Cao, T. H. (2007). Vn-kim ie: automatic extraction of vietnamese
named-entities on the web. New Gen. Comput., 25, 277-292.
Palmer, D., , Palmer, D. D., & Day, D. S. (1997). A statistical profile of the named entity
task. Proc. ACL Conference for Applied Natural Language Processing (pp. 190-193).
Petasis, G., Vichot, F., Wolinski, F., Paliouras, G., Karkaletsis, V., & Spyropou- los, C. D.
(2001). Using machine learning to maintain rule-based named-entity recognition and
classification systems. Proc. Conference of Association for Computational Linguistics (pp.
426-433).
Pham, D. D., Tran, G. B., & Pham, S. B. (2009). A hybrid approach to vietnamese word
segmentation using part of speech tags. Knowledge and Systems Engineering, International
Conference on, 0, 154-161.
Ruifeng Xu, Yunqing Xia, K.-F. W., & Li, W. (2008). Opinion annotation in online chinese
product reviews. Proceedings of the Sixth International Language Resources and
Evaluation (LREC’08). Marrakech, Morocco: European Language Resources Association
(ELRA). />Silva, J. F. F. D., Kozareva, Z., Gabriel, J., & Lopes, P. (2004). Cluster analysis and
classification of named entities. Proc. Conference on Language Resources and Evaluation.
Strassel, S. (2006). Simple named entity guidelines v6.4.
Wang, L.-J., Chang, H., Chao, & huang Chang, C. (1992). Recognizing unregistered names
for mandarin word identification. Proc. of COLING92 (pp. 1239-1243). COLING.
Whitelaw, C., & Patrick, J. (2003). Evaluating corpora for named entity recognition using
character-level features. In (Whitelaw & Patrick, 2003), 910-921.

Yu, S., Bai, S., & Wu, P. (1998). Description of the kent ridge digital labs system used for
muc-7. In Proceedings of the MUC-7.



×