Kết hợp mô hình cực đại entropy và học luật chuyển đổi cho bài toán gán nhãn từ loại

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (102.77 KB, 7 trang )

Kết hợp mô hình cực đại entropy và học luật
chuyển đổi cho bài toán gán nhãn từ loại

Nguyễn Ngọc Khương

Trường Đại học Công nghệ
Luận văn ThS. Khoa học máy tính; Mã số: 60 48 01 01
Người hướng dẫn: PGS.TS. Lê Anh Cường
Năm bảo vệ: 2014

Abstract. - Luận văn đề xuất một phương pháp cải tiến cho việc gán nhãn từ loại dựa
trên việc phân tích các đặc trưng quan hệ của một số phương pháp học máy và đánh giá
tính hiệu quả của các phương pháp này đối với bài toán gán nhãn từ loại. Trong phương
pháp đề xuất thay vì việc sử dụng các phương pháp học máy đơn lẻ, chúng tôi sử dụng
kết hợp các thuật toán học máy có xu hướng kế thừa nhau để hạn chế các lỗi gán nhãn
ngoại lệ. Trước hết chúng tôi sử dụng một trong số những phương pháp học máy tốt nhất
cho bài toán gán nhãn từ loại, phương pháp học máy thống kê cực đại hóa Entropy để xây
dựng mô hình cơ sở, sau đó sử dụng mô hình học luật chuyển đổi để sửa sai lỗi từ loại.
- Dựa trên công cụ Stanford Tagger và vnTagger, chúng tôi đã cài đặt công cụ gán nhãn
từ loại cải tiến (CBTagger) để làm thành phần gán nhãn cơ sở. Sau đó cài đặt module sửa
sai dựa trên phương pháp học luật chuyển đổi để được công cụ gán nhãn từ loại
(CTagger) dựa trên mô hình kết hợp. Chúng tôi sử dụng bộ công cụ này để kiểm tra trên
hai loại ngôn ngữ điển hình của hai loại ngôn ngữ biến hình và không biến hình để chỉ ra
tính hiệu quả của mô hình đề xuất đối với bài toán gán nhãn từ loại. Kết quả thực nghiệm
trên công cụ CTagger với các bộ ngữ liệu khác nhau cho thấy độ chính xác cao hơn đáng
kể so với mô hình cơ sở và với các bộ gán nhãn từ loại khác.
Keywords. Phương pháp tin học đặc biệt; Trích chọn thông tin; Dịch máy; Bài toán gán
nhãn từ loại; Xử lý ngôn ngữ tự nhiên
Content.
Chương 1: Khái quát về bài toán gán nhãn từ loại. Chương 1 đưa ra khái niệm, vị trí và ứng dụng
của bài toán gán nhãn từ loại trong xử lý ngôn ngữ tự nhiên. Cùng với đó, việc đề cập các nghiên

cứu liên quan và phân tích các vấn đề cơ bản của bài toán gán nhãn từ loại cũng là nội dung quan
trọng được đề cập trong chương này.
Chương 2: Kiến thức cơ sở. Trình bày một số khái niệm cơ sở trong bài toán gán nhãn từ loại,
đặc trưng kho ngữ liệu. Nội dung chương cũng giới thiệu hai phương pháp học máy điển hình
cực đại hóa Entropy và mô hình học luật chuyển đổi, được sử dụng làm mô hình cơ sở khi xây
dựng mô hình kết hợp phục vụ nhiệm vụ gán nhãn từ loại.
Chương 3: Giới thiệu mô hình kết hợp mà nhóm tác giả đề xuất cho bài toán gán nhãn từ loại.
Nội dung chương cũng đề cập đề cập quá trình phân tích đặc điểm ngôn ngữ để lựa chọn mô
hình biểu diễn ngữ cảnh, phân tích và trích chọn đặc trưng ngôn ngữ để xây dựng tập luật mẫu
phục vụ cho quá trình học của mô hình đề xuất. Nội dungchương này cũng đặt nền tảng lý thuyết
cho phần cài đặt và thực nghiệm của luận văn.
Chương 4: Thực nghiệm mô hình kết hợp cho bài toán gán nhãn từ loại tiếng Việt, tiếng Anh và
đánh giá kết quả. Chương này trình bày các công việc thực nghiệm mà luận văn đã tiến hành,
bao gồm việc lựa chọn tập đặc trưng và áp dụng mô hình kết hợp để giải quyết bài toán gán nhãn
từ loại. Từ kết quả thực nghiệm, tiến hành đối chiếu, so sánh và đưa ra một số nhận xét về ưu,
nhược điểm của mô hình kết hợp đối với bài toán gán nhãn từ loại so với mô hình cơ sở và với
một số mô hình sẵn có.
Phần kết luận cũng tóm lược các kết quả đã đạt được và đóng góp của luận văn, đồng thời định
hướng một số hướng nghiên cứu trong thời gian tới.
References.
[1] M. P. Lewis, Ethnologue: Languages of the World, 16th edition, Ethnologue, 2009.
[2] P. T. T. C. H. T. Nguyễn Quang Châu, “Gán nhãn từ loại cho Tiếng Việt dựa trên văn phong
và tính toán xác suất,” Tạp chí phát triển KH&CN, pp. Tập 9, số 2, 2006.
[3] Y. Halevi, "Part of Speech Tagging Slide," The Blavatnik School of Computer Science– Tel
Aviv University, 25 April 2006.
[4] R. M. Paroubek P., "Etiquetage morpho-syntaxique," in Ingénierie des langues, Hermes
Science Europe, 2000, p. Chapitre 5.
[5] B. E., "Transformation-Based Error-Driven Learning and Natural Language Processing: A
Case Study in Part of Speech Tagging," Computational Linguistics, vol. 21, no. 4, pp. 543-565,
December 199.

[6] K. G. Dermatas E., "Automatic Stochastic Tagging of Natural Language Texts,"
Computational Linguistics, vol. 21, no. 2, pp. 137 - 163, 1995.
[7] S. H., "Part-of-Speech Tagging with Neural networks," in International Conference on
Computational Linguistics, Kyoto, Japan, 1994.
[8] S. T. El-Bèze M, "Etiquetage probabiliste et contraintes syntaxiques," in Actes de la
conférence sur le Traitement Automatique du Langage Naturel (TALN95), Marseille, France,
14-16/6/1995. [9] T. D., "Tiered Tagging and combined classifier," In Jelineck F. and Nörth E.
(Eds),Text, Speech and Dialogue, Lecture Notes in Artificial Intelligence 1692, Springer, 1999.
[10] H. L. S. H. M. Robert Dale, Handbook of Natural Language Processing, New York, NY,
USA: Marcel Dekker, Inc, 2000.
[11] J. H. M. Daniel Jurafsky, Speech and Language Processing, Prentice-Hall, Inc, 2000.
[12] M. K. K. T. S. K. Nakamura M., "Neural network approach to word category prediction
for English texts," in Proceedings of the 13th Conference on Computational Linguistics
(COLING 90), Prentice-Hall, Inc, 1990.
[13] J. G. ,. Y. Z. ,. X. C. ,. A. W. Jie Yang, "An Automatic Sign Recognition and Translation
System," in PUI '01 Proceedings of the 2001 workshop on Perceptive user interfaces , New
York, NY, USA, 2001.
[14] S. B G. a. Z. Z. Dragomir Radev, "Experiments in single and multi-document
summarization using MEAD," in First Document Understanding Conference, New Orleans, LA,
September 2001.
[15] A. A. A. a. L. Asker, "An Amharic Stemmer : Reducing Words to their Citation Forms,"
in In proceedings of Computational Approaches to Semitic Languages: Common Issues and
Resources, Prague, Czech Republic, June 2007.
[16] S. Dandapat, "Part-of-Speech Tagging for Bengali," Indian Institute of Technology,
Kharagpur, 2011.
[17] Greene B. B. and Rubin G. M., "Automatic grammatical tagging of English," Technical
Report, Department of Linguistics, Brown University., 1971.
[18] J. H. D. Jurafsky, Speech and Language Processing, Englewood Cliffs, New Jersey 07632:
Prentice Hall, 1999.
[19] H. L. S. H. M. Robert Dale, Handbook of Natural Language, New York, NY, USA: Marcel

Dekker, Inc, 2000.
[20] B. E., "A simple rule-based part-of-speech tagger," in In Proceedings of the 3rd
Conference on Applied NLP, 1992.
[21] B. E., "Transformation-based error-driven learning and Natural Language Processing: A
case study in part-of-speech tagging," Computational Linguistics, vol. 21, no. 4, pp. 543-565,
1995a.
[22] B. E., "Unsupervised learning of disambiguation rules for part of speech tagging," in In
Proceedings of 3rd Workshop on Very Large Corpora Workshop, Massachusetts, 1995b.
[23] L. H. Quỳnh, “So sánh một số phương pháp học máy cho bài toán gán nhãn từ loại tiếng
Việt,” Luận văn cao học, trường Đại học Công nghệ, Đại học Quốc gia Hà Nội, Hà Nội, 2009.
[24] T. T. Oanh, “Mô hình tách từ, gán nhãn từ loại và hướng tiếp cận tích hợp cho tiếng Việt,”
Luận văn cao học, trường Đại học Công nghệ, Đại học Quốc gia Hà Nội, Hà Nội, 2008.
[25] A. M. F. P. John Laferty, "Conditional Random Fields: Probabilistic Models for
segmenting and labeling Sequence Data," in Proc. of the Eighteenth International Conference on
Machine Learning (ICML-2001), 2001.
[26] J. D. M. G. M. M S. J. R. M B. a. A. J. S. Emilio Soria Olivas, Handbook of Research on
Machine Learning Applications and Trends: Algorithms, Methods, and Techniques, 2009.
[27] K. T. a. M. Y. Nakagawa T., "Unknown word guessing and part-of-speech tagging using
support vector machines," In Proceedings of the Sixth Natural Language Processing Pacific Rim
Symposium, pp. 325-331, 2001.
[28] G. J. a. M. L., "Fast and accurate part-of-speech tagging: The SVM approach revisited," in
In Proceedings of RANLP, 2003.
[29] J. F. L. J. M. R. a. R. S. Black E., "Decision tree models applied to the labeling of text with
parts-of-speech.," in In Proceedings of the DARPA workshop on Speech and Natural Language,
Harriman, New York., 1992.
[30] E. M. a. G. B., "Tagging experiment using neural networks.," in In Proceeding of the 9th
Nordic Conference of Computational Linguistic, Sweden, 1994.
[31] M. Q. a. I. H., "A multi-neuro tagger using variable lengths of contexts.," in In
Proceedings of the 17th international conference on Computational linguistics, Montreal,
Quebec, Canada, 1998.

[32] K. J. H. a. K. G. C., "Fuzzy network model for part-of-speech tagging under small training
data," Natural Language Engineering, vol. 2, no. 2, pp. 95-110, 1996.
[33] Y. Z. T. L. a. S. L. Jinshan M., "A Statistical Dependency Parser of Chinese under Small
Training Data," 2004.
[34] A. M. a. M. Y., "Extended models and tools for high- performance part-of-speech tagger,"
in Proceedings of the 18th conference on Computational linguistics, Saarbrücken, Germany,
2000.
[35] H. M. a. M. Y., "Mistake-driven mixture of hierarchical tag context trees," in In
Proceedings of the eighth conference on European chapter of the Association for Computational
Linguistics, Madrid, Spain, 1997.
[36] S. S. a. B. A. Dandapat S., " Automatic Part-of-Speech Tagging for Indian: An approach
for Morphologically Rich Languages in a Poor Resource Scenario.," in In Proceedings of the
Association of Computational Linguistics (ACL ), Prague, Czech Re, 2007.
[37] B. C., "Unsupervised Natural Language Processing using Graph Models," in In
Proceedings of the NAACL-HLT Doctoral Consortium, Rochester, 2007.
[38] D. S. a. N. V., "Unsupervised Part-of-Speech Acquisition from Resource-Scare
Languages," in In Proceedings of the Conference on Empirical Methods in Natural Language
Processing and Computational Natural Language Learning, Prague, 2007.
[39] M. R. S. S. G. K. a. B. P. Shrivastav M., "Conditional Random Field Based POS Tagger
for Hindi," in In Proceedings of the MSPIL, Bombay, 2006.
[40] M. Y. a. T. J. Kazama J., "A maximum entropy tagger with unsupervised hidden markov
models," in In Proceedings of the 6th NLPRS, 2001.
[41] S. G. a. P. F., "Aggregate and mixedorder Markov models for statistical language
processing," in In Proceedings of the 2nd International Conference on Empirical Methods in
Natural Language Processing., 1997.
[42] B. T, "TnT – A statistical part-of-sppech tagger," in In Proceedings of the 6th Applied
NLP Conference, 2000.
[43] S. F. a. P. F., "Shallow parsing with conditional random fields," in In Proceedings of the
Conference of the North American Chapter of the Association for Computational Linguistics on
Human Language Technology, Edmonton, Canada, 2003.

[44] V. X. L. L. H. P. Nguyễn Thị Minh Huyền, “Sử dụng bộ gán nhãn từ loại xác suất QTAG
cho văn bản Tiếng Việt,” trong Hội thảo ICT.rda, 2003.
[45] L. M. H. N. C. T. Phan Xuân Hiếu, “Gán nhãn từ loại tiếng Việt dựa trên các phương pháp
học máy thống kê,” Hà nội, 2009.
[46] H. K. Dinh Dien, "POS-Tagger for English-Vietnamese Bilingual Corpus," Building and
Using Parallel Texts Data Driven Machine Translation and Beyond, pp. 88-95, 12 7 2003.
[47] D. S. a. N. V., "Unsupervised Part-of-Speech Acquisition from Resource-Scare
Languages," in Proceedings of the Conference on Empirical Methods in Natural Language
Processing and Computational Natural Language Learning, 2007.
[48] M. P. M. M. A. a. S. Marcus, "Building a Large Annotated of English: The Penn
Treebank,"Computation Linguistics, vol. 1, no. s, p. 1, 1993.
[49] C. K. W., "A stochastic parts program and noun phrase parser for unrestricted text," in
Proceedings of the second conference on Applied Natural Language Processing, Austin, Texas,
1988.
[50] V. X. L. ,. N. T. M. H. Nguyễn Phương Thái, “Xây dựng treebank tiếng Việt,” Hanoi,
2008.
[51] A.Ratnaparkhi, "A maximum entropy model for part-of-speech tagging," in Proc.
Emparical Methods for Natural Language Processing, 1996.
[52] A. Ratnaparkhi., " A Maximum Entropy Model for Part-Of-Speech Tagging," in
Proceedings of the Empirical Methods in Natural Language Processing Conference (EMNLP),
University of Pennsylvania., 1996.
[53] R. R. S. R. Raymond Lau, "Adaptive language modeling using the maximum entropy
principle," in HLT '93 Proceedings of the workshop on Human Language Technology,
Stroudsburg, PA, USA, 1993.
[54] A. R. T. M. H. N. M. R. Phuong Le-Hong, "An empirical study of maximum entropy
approach for part-of-speech tagging of Vietnamese texts," in Traitement Automatique des
Langues Naturelles - TALN2010, Montreal, Canada, 2010.
[55] K. T. a. C. D. Manning, "Enriching the Knowledge Sources Used in a Maximum Entropy
Part-of-Speech Tagger," in Proceedings of the Joint SIGDAT Conference on Empirical Methods
in Natural Language Processing and Very Large Corpora (EMNLP/VLC-2000), 2000.

[56] E. Brill, "Transformation-based error-driven learning and natural language processing: a
case study in part-of-speech tagging," Computer Linguist, vol. 21, p.
543–565, December 1995.
[57] J. C. D. a. C. N. D. S. R. L. Milidiú, "TBL Template Selection: An Evolutionary
Approach," in Current topics in artificial intelligence, Berlin, Springer-Verlag,
2007, p. 180–189.
[58] K. T. a. C. D. Manning, " Enriching the Knowledge Sources Used in a Maximum
Entropy Part-of-Speech Tagger," in Proceedings of the Joint SIGDAT Conference on Empirical
Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-2000), 2000.
[59] R. Kohavi, "A Study of Cross-Validation and Bootstrap for Accuracy Estimation and
Model Selection," in IJCAI 14 (2), 1137-1145, 1995.
[60] A. S., Parsing by chunks, Kluwer Academic Publishers, 1991.
[61] A. S., Part-of-speech tagging and partial parsing, Kluwer, Dordrecht.: Ken Church, Steve
Young, and Gerrit Bloothooft, editors, 1997.
[62] R. R. K. a. S. L. Arulmozhi P., "A Hybrid POS Tagger for a Relatively Free Word Order
Language," in In Proceedings of the Modeling and Shallow Parsing of Indian Language
(MSPIL), Bombay, 2006.
[63] Baum L. E., "An inequality and associated maximization technique in statistical estimation
for probabilistic functions of a Markov process," Inequalities, vol. 3, pp. 1-8, 1972.
[64] N. V. J. Ide, "Introduction to the Special Issue on Word Sense Disambiguation,"
Computational Linguistics, vol. 24, no. 1, pp. 1-40, 1998.

Kết hợp mô hình cực đại entropy và học luật chuyển đổi cho bài toán gán nhãn từ loại

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về