Tải bản đầy đủ (.docx) (16 trang)

Identifying coordinated compound words for Vietnamese word segmentation

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (455.04 KB, 16 trang )

VIETNAM NATIONAL UNIVERSITY, HA NOI
UNIVERSITY OF ENGINEERING AND TECHNOLOGY

CÔNG TRÌNH DỰ THI GIẢI THƯỞNG SINH VIÊN
NGHIÊN CỨU KHOA HỌC
NĂM 2012
Tên công trình:
Identifying coordinated compound words for Vietnamese
word segmentation
Họ và tên sinh viên: Nguyễn Minh Cường Nam/nữ: Nam
Lớp: K53CA Khoa: KHMT
Người hướng dẫn: Ts. Nguyễn Phương Thái
Ths. Trần Ngọc Anh
Ha Noi – 2012
Abstract
Word segmentation is considered the first step in most natural language processing
applications. Vietnamese word segmentation encounters some difficulties that other
occidental language does not. English and many other languages use blanks to separate
words which is easy for a tokenizer to do word segmentation tasks. Vietnamese words
can be formed by one syllables, two or more than two syllables. In natural language
processing, a dictionary is an essential resources to the analysis of language problems
from simple to complex. In most vietnamese dictionary, there are small amount of
coordinated compound words defined. Since most of natural language processing
depend heavily on dictionary in word segmentation step, there are much problems
apprear when the tokenizer detecting coordinated compound words. We are trying to
build a coordinated compound word with large number of words which we hope that
helps to improve the accuracy of vietnamese segmentation task.
2
Contents
Figure List
3


Chapter 1
Introduction
Word segmentation is considered the first step in most natural language
processing applications. Vietnamese word segmentation encounters some difficulties
that other occidental language does not. English and many other languages use blanks
to separate words which is easy for a tokenizer to do word segmentation tasks.
Vietnamese words can be formed by one syllables, two or more than two syllables. In
general, Vietnamese compound word meaning is created by combining the meaning of
each syllables that made the compound words, and blanks are not used to separate
Vietnamese word. That creates problems for all natural language processing tasks. The
main problems include word ambiguities, unknown words detection and proper name
recognition.
4
Chapter 2
Vietnamese word segmentation
2.1 Coordinated Compound Word
2.1.1 Definition
Coordinated compound words are made up of two or more single syllables and
the meaning of each word is combination of meaning of each syllable which has
similar meaning. The syllables that made up coordinated compound word are in equal
relation. In other words, the meaning of coordinated compound word is more general
than of each syllable, and equally based on meaning of them.
The order of coordinated compound word is oftenly changeable. For example:
“quần áo”, “áo quần”, “chung riêng”, “riêng chung”, “đen đỏ”, “đỏ đen”, “ốm đau”,
“đau ốm”,…
2.1.2 Type of coordinated compound words
There ara two types of coordinated compound word:
 All syllables are Vietnamese origin words: “đất nước”, “trời đất”, “đất
cát”, “ruộng đấy”, “rượng vường”, “ruộng nương”, “ấm chén”, “bát đĩa”,
“đỏ đen”, “trắng đen”, “may rủi”, etc.

 All syllables are Chinese borrowed: “ân nghĩa”, “nam nữ”, “đầu não”,
“đấu tranh”, “học tập”, “lợi lộc”, “thuận lợi”, etc.
5
 One syllable is Vietnamese origin word and one is borrowed from
Chinese: “binh lính”, “bụng dạ”, “lính tráng”, “nuôi dưỡng”, “gan
dạ”,etc.
2.2 VCL Dictionary
In natural language processing, a dictionary is an essential resources to the
analysis of language problems from simple to complex. A good quality vocabular
should provide the language processing system with natural language information in
many diffirent steps such as morphology, grammar, semantics, or even able to used for
single language processing system or multiple language processing system.
VCL (Vietnamese Computational Lexicon) is a dictionary from Vietlex with
35000 words which is created for natural language processing purposes. Each word in
the dictionary is represented with the information of morphology, syntactics and
semantics.
 Morphology:
Morphology information include HeadWord, WordType
Figure 1 Basic information and morphology of “bàn” (noun)
 Syntactics :
6
Syntactics information includes category (noun, verb, adverb, adjective,etc ),
subcategory ( proper noun, countable noun, abstract noun, etc), frame set, forward
and backward.
Figure 2 Syntactics of “bàn”(verb) – frameset
Figure 3 Syntactics of “ăn”(verb) – forward, backward
 Semantics information
Semantics information include logical constraint and semantic contrainst.
7
- Logical constraint include categorial meaning,synonym and

antonym.Categorial meaning can be understand as a “semantic-
wordtype”, for example ‘tướng sĩ’ and “tướng tá” are belongs to
“People”, “trâu” and “bê” belongs to “Mammal”,etc. Synonym and
antonym helps with analysing and using words correctly.
Figure 4: Semantics tree
- Semantic contraints: information about “semantic role” of words when
standing in sentences: Agent, experiencer, possessor, force, patient,
recipient, reference, concomitant,etc
8
Figure 5 Semantics information of “bắt” (verb)
Figure 6 VCL in xml format
9
Chapter 3
Building Coordinated Compound Word
Dictionary
Vietnamese word segmentation is highly based on the definition of the word in
dictionary. A good dictionary is very important in vietnamese word segmentation.
The dictionary contain small amount of coordinated compound words. The purpose of
building a coordinated compound word is increase the accuracy of vietnamese word
segmentation when detecting coordinated compound words.
There are several steps when building coordinated compound word dictionary
base on the VCL dictionary.
3.1 Finding coordinated compound words that already been
defined in VCL dictionary
This step can be helped by a small web-base system.After this step the
dictionary now have more than 1600 coordinated compound words.
 Using Rails 3.1 framework with Mongoid database.
 Read the VCL dictionary and store in database
 Display the dictionary
 Approaches: Most of the coordinated compound word that defined in VCL

dictionary have one or more of below characteristics:
10
• The <def> field contain string “[nói khái quát]”
• The <def> field contain string “[nói gộp]”
• The <def> field contain the syllales and word “và”.
• The <synonym> field contain the reverse word of the main word.
 Query all the possible case and sort for the most number of conditions
meeting first.
Write a script to help choose the correct coordinated compound words with just
one click. (the choosen word will be then displayed italic and set a flag to true)
Figure 1 Example of coordinated compound word
11
3.2 Try to classify these compound words and other simple words
Try to classify these compound words and other simple words from dictionary
into ‘categorial meaning’, (semantic-wordtype), in each class, match two simple words
that belongs the same ‘categorial meaning’ to make new coordinated compound
words. For examples.
giường giường chiếu
chiếu chăn màn
chăn

chăn chiếu
màn chăn gối
gối chiếu gối
màn chiếu
12
Figure 2 classify the simple words to ‘categorial meaning’

3.3 Find the new coordinated compound words by reverse the old
word

quần áo => áo quần
chung thủy => thủy chung
đỏ đen => đen đỏ
rừng núi => núi rừng
bay lượn => lượn bay
Create all the possible reverse word from all the coordinated compound words
that we already reviewed. Each new created words have the same ‘categorial
meaning’, category, subcategory and definition with the original word.
13
Figure 3 List all reverse word of coordinated compound words then check.
3.4 Review and estimate the accuracy of the dictionary
The new coordinated compound words (about 3000 words) have the same
format of the VCL dictionary and it can be easily used for improving the accuracy of
vietnamese word segmentation
14
3.5 Future work
For some reason (time limit, vietnamese words knowledge), the dictionary is
still small. The work is still continuing finding more words to make the dictionary to
be large
15
References
[1] D.Q.Thang 2008, Word segmentation of Vietnamese texts: a comparison of
approaches
[2] Cam-Tu Nguyen 2008, Vietnamese Word Segmentation with CRFs and SVMs:
An Investigation
[3] Le.An.Ha 2003, A method for word segmentation in Vietnamese orpus
Linguistics, Lancaster, UK (2003)
16

×