DSpace at VNU: Extraction of Vietnamese collocation from text corpora

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (142.94 KB, 5 trang )

Extraction of Vietnamese collocation from
text corpora
Đỗ Thị Ngọc Quỳnh

Trường Đại học Công nghệ
Luận văn Thạc sĩ ngành: Khoa học máy tính; Mã số: 60 48 01
Người hướng dẫn: TS. Lê Anh Cường
Năm bảo vệ: 2011
Abstract. Collocations have wide application in the fields of languages, compiled a
dictionary as well as the problem of natural language processing. Therefore, the
extraction of collocations in each language is really necessary, to improve the
accuracy and the nature of the application of natural language processing, as well as
help to learn a new language easier. However, in Vietnam, the study of collocation is
quite a new field. This paper focused on researching some method of extracting
collocations methods to find efficient model for the Vietnamese collocations
extraction. The mentioned methods were based on some classic statistical methods
commonly used such as frequency, t-test, chi-square, mutual information... We also
suggested some general method using linguistic measure to increase the accuracy of
the process of extraction. Input data included the data has been through a POStagging and data has been parsed. By running the program with different methods
and combination of multiple methods together, comparing the accuracy of the
method, we draw out the efficient method of extracting of Vietnamese Collocation
from Text Corpora.
Keywords. Xử lý ngôn ngữ; Xử lý dữ liệu; Ngôn ngữ tự nhiên; Trí tuệ nhân tạo

Content
Table of Contents

1

Introduction
1.1 Definitions....................................................................................................

1.2 Related works and motivation .....................................................................
1.3 Contribution of the thesis ............................................................................

2
2
3
6

2

Collocation: concept, roles and applications

7

Collocations’ characteristics ........................................................................
2.1.1 Recurrent ..........................................................................................
2.1.2 Arbitrary ..........................................................................................
2.1.3 Domain-dependent ...........................................................................
2.1.4 Non-substitutability (theclosely linked in terms of vocabulary)
Classification of collocations .......................................................................
2.2.1 Idiomatic Phrases .............................................................................
2.2.2 Support Verb Construction ..............................................................
2.2.3 Fixed Phrases ...................................................................................
Applications .................................................................................................
Vietnamese collocations .............................................................................

9
10
10

10
11
12

3

Basic methods in Collocation extraction
3.1 Frequency.....................................................................................................
3.2 Hypothesis testing .......................................................................................
3.2.1 T-Test ...............................................................................................
3.2.2 Chi-Square ......................................................................................
3.3 Point-wise Mutual Information (PMI) .........................................................

14
15
16
17
18
20

4

Our proposal for extracting Vietnamese collocation
23
4.1 Patterns for Vietnamese collocation ........................................................... 23
4.2 The Linguistic Measure ............................................................................... 24
4.3 Designed model ............................................................................................ 25

5

Experiments
27
5.1 Data preparation ............................................................................................ 27
5.1.1 Collecting corpora ............................................................................ 27
5.1.2 Extracting bi-grams ........................................................................... 28
5.1.3 Adding syntactic information to bi-grams .......................................28
5.2 The test models ............................................................................................ 29
5.3 Experimental results with statistical methods .......................................... 30
5.3.1 Bi-grams with syntactic information ................................................. 31
5.4 The experiments of our proposal .................................................................. 32

6

Conclusion

2.1

2.2

2.3
2.4

Bibliography

7
8
8
8
9

35
36

References
[1] M. Benson. The Structure of the Collocational Dictionary. International Journal of
Lexicography, 2(1):1-14, 1989.
[2] Raj Kishor Bisht and H. S. Dhami. The application of fuzzy logic to collocation
extraction. CoRR, abs/0811.1260, 2008.

[3] Elisabeth Breidt. Extraction of v-n-collocations from text corpora: A feasibility study

for german. In In CoRR-1996, pages 74-83, 1993.
[4] Mai Ngọc Chừ; Vũ Đức Nghiêu và Hoàng Trọng Phiến. Cơ sở ngôn ngữ học và tiếng
Việt. Nxb Giáo dục, 1997.
[5] John Carroll, Guido Minnen, Darren Pearce, Yvonne Canning, Siobhan Devlin, and
John Tait. Simplifying english text for language impaired readers. 1999.
[6] G. Castiglione, A. Restivo, and S. Salemi. Patterns in words and languages. Discrete
Appl. Math., 144:237-246, December 2004.
[7] Y. Choueka, A. S. Fraenkel, and S. T. Klein. Compression of concordances in full-text
retrieval systems. In Proceedings of the 11th annual international ACM SIGIR
conference on Research and development in information retrieval, SIGIR ’88, pages
597-612, New York, NY, USA, 1988. ACM.
[8] Hoàng Thị Châu. Vài nhận xét về quá trình tiêu chuẩn hoá tiếng việt thể hiên qua cách
dùng từ địa phương trong sách vở, báo chí truớc và sau cách mạng tháng tám. (4), 1970.
[9] Kenneth Ward Church and Patrick Hanks. Word association norms, mutual information,
and lexicography. Comput. Linguist., 16:22-29, March 1990.
[10]A. P Cowie. The treatment of collocations and idioms in learners’ dictionaries. Applied
Linguistics, II:223-235, March 1981.
[11]D.A Cruse. Lexical semantics. Cambridge University Press, 1991.

[12] John Rupert Firth. A synopsis of linguistic theory 1930-1955. In Studies in Linguistic

Analysis, pages 1-32. Blackwell, Oxford, 1957.Eric Gaussier, David Hull, and Salah Aitmokhtar. Term Alignment in Use: Machine-Aided Human Translation. 2000.
[13] John S. Justeson and Slava M. Katz. Technical terminology some linguistic properties

and an algorithm for identification in text. Natural Language Engineering, 1(1):9-27,
1995.
[14] John S. Justeson and Slava M. Katz. Technical terminology: some linguistic properties

and an algorithm for identification in text. In Natural Language Engineering, pages 927. Cambridge University Press., 1995.
[15] Adam Kilgarriff and David Tugwell. Word sketch: Extraction and display of significant

collocations for lexicography. Proc ACL workshop on COLLOCATION Computational
Extraction Analysis and Exploitation Toulouse July 3238, 2001.
[16] Vuong Hoai Vu Pham Minh Thu Ho Tu Bao Le Anh Cuong, Nguyen Phuong Thai. An

experimental statiscal on lexicalized parsing for vietnamese. KSE, 2009.
[17] Dekang Lin. Extracting Collocations from Text Corpora. 1998.
[18] Dekang Lin. Extracting Collocations from Text Corpora. 1998.
[19] Dekang Lin. Using collocation statistics in information extraction. In In Proceedings of

the Seventh Message Understanding Conference (MUC-7, 1998.
[20] Christopher D. Manning and Hinrich Schütze. Foundations of statistical natural

language processing. MIT Press, Cambridge, MA, USA, 1999.
[21] Christopher D. Manning and Hinrich Schuütze. Foundations of statistical natural

language processing. MIT Press, Cambridge, MA, USA, 1999.
[22] Johannes Matiasek. Exploiting long distance collocational relations in predictive typing.

In Proceedings of the EACL-03 Workshop on Language Modeling for Text Entry
Methods, pages 1-8, 2003.
[23] Gitsaky C.Daigaku N. and Tailor R. Iranian Journal of Applied Linguistics, pages 137-

169.
[24] Darren Pearce and Bn Qh. Using conceptual similarity for collocation extraction. In

Proceedings of the Fourth annual CLUK colloquium, 2001.
[25] Pavel Pecina and Pavel Schlesinger. Combining association measures for collocation

extraction. In In Proceedings of the COLING/ACL 2006 Main Conference Poster
Sessions, pages 651-658.Sasa Petrovic. Collocation extraction measures for text mining
applications. In Diploma Thesis num. 1693, 2007.
[26] Yin Li Qin Lu and Ruifeng Xu. Improving xtract for chinese collocation extraction. In

Proceedings of IEEE Int. Conf. Natural Language Processing and Knowledge
Engineering, pages 333-338, 2003.
[27] Violeta Seretan and Eric Wehrli. Accurate collocation extraction using a multilingual

parser. In Proceedings of the 21st International Conference on Computational
Linguistics and the 44th annual meeting of the Association for Computational
Linguistics, ACL-44, pages 953-960, Stroudsburg, PA, USA, 2006. Association for
Computational Linguistics.
[28] Violeta Seretan and Eric Wehrli. Multilingual collocation extraction: issues and

solutions. In Proceedings of the Workshop on Multilingual Language Resources and
Interoperability, MLRI ’06, pages 40-49, Stroudsburg, PA, USA, 2006. Association for
Computational Linguistics.
[29] Frank Smadja. Retrieving collocations from text: Xtract. Comput. Linguist., 19:143-177,

March 1993.
[30] Frank Smadja and Kathleen McKeown. Translating collocations for use in bilingual

lexicons. In Proceedings of the workshop on Human Language Technology, HLT ’94,
pages 152-156, Stroudsburg, PA, USA, 1994. Association for Computational
Linguistics.
[31] David A. Smith. Detecting events with date and place information in unstructured text.

In Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries, JCDL
’02, pages 191-196, New York, NY, USA, 2002. ACM.
[32] The teaching of collocations in EAP. Technical report University of Leeds.
[33] Nguyen Cam Tu. Hidden topic discovery toward classification and clustering in

vietnamese web documents. In Master Thesis in College of Technology, Viet Nam
National University, 2008.
[34] James Liu Wan Yin Li, Qin Lu. Tctract-a collocation extraction approach for noun

phrases using shallow parsing rules and statistic models. In 20th Pacific Asia

Conference on Language, Information and Computation (PACLICi06), pages 109-116,
2006.
Joachim Wermter and Udo Hahn. Collocation extraction based on
modifiability statistics. In Proceedings of the 20th international conference on
ComputationalLinguistics, COLING ’04, Stroudsburg, PA, USA, 2004. Association for
Computational Linguistics.
[35]

[36] Janyce Wiebe, Theresa Wilson, and Matthew Bell. Identifying collocations for

recognizing opinions. In In Proc. ACL-01 Workshop on Collocation: Computational
Extraction, Analysis, and Exploitation, pages 24-31, 2001.
Thesis-related publication:
J Le Anh Cuong, Do Thi Ngoc Quynh and Cao Van Viet. Building and Evaluating
Vietnamese Language Models VNU. JOURNAL OF SCIENCE (revising).
J Le Anh Cuong, Do Thi Ngoc Quynh Vietnamese collocation extraction (to be
submitted).

DSpace at VNU: Extraction of Vietnamese collocation from text corpora

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về