Building a treebank for vietnamese depen

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (232.61 KB, 5 trang )

Building a Treebank for Vietnamese
Dependency Parsing
Luong Nguyen Thi
Dalat University
Information technology
Lamdong, Vietnam
Email:

Linh Ha My, Hung Nguyen Viet,
Huyen Nguyen Thi Minh, Phuong Le Hong
VNU University of Science
Hanoi, Vietnam
Email: , ,
,

Abstract—The problem of Vietnamese syntactic parsing, especially constituency parsing, has recently been tackled by several
research groups. A common effort of the Vietnamese language
processing community has allowed the creation of VietTreebank,
a reference parsed corpus containing about 10,000 sentences for
the constituency parsing task. In this paper, we present our work
to build a reference treebank, based on VietTreebank, for the
dependency parsing task which has not been very well studied
for Vietnamese. First we define a dependency label set in adapting
the dependency schema developed by the NLP group at Stanford
university and in taking into account the particularities of
Vietnamese grammar. Then we propose an algorithm to convert
a constituency treebank to a dependency one. The algorithm is
tested on a set of 100 sentences of VietTreebank corpus and
gives very good results. Finally, we carry out an experiment on
Vietnamese dependency parsing using MaltParser tool and the
dependency treebank converted from VietTreebank.

I. INTRODUCTION
Dependency parsing has been one interesting approach to
syntactic parsing in recent years. The basic idea of dependency
parsing is to find the syntactic structure which consists of
lexical items, linked by binary asymmetric relations called
dependencies. There have been many studies on the dependency parsing. Many tools have been developed to solve this
problem. Especially, methods based on machine learning give
high accuracy parsing results on in English, Chinese, Swedish.
For Vietnamese, most studies centered on constituency parsing such as [?], [?]. The Vietnamese treebank reported in [?]
consists of about 10,000 sentences in Penn treebank format.
For dependency parsing, there exists only two works, one of
Nguyễn Lê Minh et al. [?] which uses MST parser on a corpus
consisting of 450 sentences, and one of Lê Hồng Phương et
al. [?], which uses a lexicalized tree-adjoining grammar parser
trained on a subset of the Vietnamese treebank.
In this paper, we report our work on building a large
corpus for Vietnamese dependency parsing. We first develop
algorithms for converting from constituency structure to dependency structure. We then use the resulting dependency treebank to train and evaluate MaltParser - a language-independent
dependency parser [?] and report the parsing results.
This paper is organized as follows. The next section introduces dependency parsing where basic concepts and some
existing works are given. The following section presents the
construction of a Vietnamese dependency treebank. Finally,

the last section reports experimental results on Vietnamese
dependency parsing with MaltParser.
II. DEPENDENCY PARSING
A. Definition
Syntax is the subject of two research communities consisting
of linguists and computer scientists. Natural language is the

object of study of linguists where formal syntax is one
language level to be described. Computer scientists develops
models and algorithms for computer to analyze formal syntax
to build natural language processing applications.
Dependency syntax is syntactic structures containing lexical
items, or tokens, connected by binary asymmetric relations
called dependencies. A dependency relation between two tokens can be named to clarify the relationship between them.
Dependency structure is determined by the relationship
between the center token (head) and its dependent token
(dependent), denoted by an arrow. By convention, the root of
the arrow is the head, and the top of the arrow is the dependent.
In comparison to constituency structure, dependency structure
is more appropriate to represent syntactic structures of free
languages, such as Czech or Turkish.
In dependency parsing, each syntactic parse of a sentence
can be represented by a dependency graph. A dependency
graph is a graph where each node is a token of the sentence.
Arcs (edges) of the graph are used to represent dependency
relationship between two nodes and the name of the arc is
dependency label between those nodes.
For example, consider an English sentence: "Bills on ports
and immigration were submitted by Senator Brownback, Republican of Kansas". Its dependency graph contains 13 nodes
corresponding to 13 words and 12 relationships connecting words. The relationships presented in the sentence are
prep(Bills, on), pobj(on, ports). . . [?].
Also by convention, there is a special node, which does not
correspond to any token in the sentence and always represents
the root of the dependency graph.
Dependency parsing is the problem of constructing the
most probable dependency graph for a given input sentence.
The input a dependency parser is a tokenized and part-ofspeech tagged sentence. Most studies on dependency parsing

employ machine learning techniques. To build a supervised

submitted
auxpass

nsubjpass

Bills

were

by

prep

prep

on

by

pobj cc

ports

prep

and

conj

pobj

immigration

Brownback
nn

Senator

appos

Republican
prep

of
pobj

Kansas
Fig. 1. Dependency graph of an English sentence.

dependency parser for a language, we need a large dependency
treebank of that language.
B. Related Works
Recently dependency parsing has been received the attention
of many research groups. There have been many studies and
softwares on dependency parsing: MaltParser, StanfordParser,
MSTParser. . . Most dependency parsing tools achieve high
accuracy and suitable for many languages as English, Chinese,

German, Czech. . . The accuracy of a parser is evaluated using
two indices: unlabeled attachment score, which is the proportion of correct head - ASU , and labeled attachment score,
which is the proportion of correct head and correct dependency
type - ASL .
1) MSTParser: MSTParser is developed by Ryan McDonald et al [?]. MSTParser has two processes: training and
analysis. In training, MSTParser uses on-line algorithms [?].
In analysis, MSTParser uses a graph-based algorithm. The
accuracy of MSTParser on a variety of languages is quite high:
ASU = 92.8%, ASL = 90.7% for Japanese, ASU = 91.1%,
ASL = 85.9% for Chinese, ASU = 90.4%, ASL = 87.3%
for German. . . 1
2) Stanford Parser: Stanford Parser is developed by NLP
group at Stanford University. Stanford Parser defines 53 dependency types for English based on Penn Treebank [?].
The accuracy of the parser is quite high, in particular for
English ASU = 87.2% and ASL = 84.2%. This parser have
been extended to parse languages other than English, such as
Chinese, German, French and Arabic.2

3) MaltParser: MaltParser is developed by Johan Hall et al.
MaltParser is the most effectively dependency parsing tool,
with high accuracy for more than 20 languages. MaltParser
has two processes: training and analysis. In training, MaltParser uses support vector machines algorithm. In analysis,
MaltParser uses a transition-based algorithm. The accuracy of
the tool is high, for example ASU = 88.1%, ASL = 86.3%
for English and ASU = 88.1%, ASL = 83.4% for German.3
All of the above tools are trained using supervised machine
learning algorithms and require a large corpus for concerned
languages. There does not exist such a dependency corpus for
Vietnamese. The most important step to develop a dependency
parser for Vietnamese is to build a dependency corpus. In

the next section, we present our work on constructing a
Vietnamese dependency corpus.
III. BUILDING VIETNAMESE DEPENDENCY TREEBANK
The orgininal constituency treebank is a corpus containing
about 10,000 sentences in Penn treebank format. An example
sentence is (S-TTL (NP-SUB (Nc-H Mảnh) (N đất) (PP (E-H
của) (NP (N-H đạn) (N-H bom)))) (VP (R không) (V-H còn)
(NP-DOB (N-H người) (A nghèo))) (. .)), where
• S, NP, PP are the labels of phrases and clauses;
• Nc, N, R are the labels of tokens;
• SUB, H, DOB,... are the functional syntactic labels of
phrases, clauses or tokens.
We design an algorithm to convert this constituency treebank
to a dependency treebank. The algorithm has two steps: (1)
determining all the dependencies in the sentence and (2)
labeling the dependency relations. The first step is solved
by determining the central element (head element) of all
grammatical phrases and clauses using head rules. The second
step is done by using a dependency label set and a rule for
labeling dependencies.
A. Dependency Schema
Different dependency labels represent different types of
relationships between pairs of tokens of a sentence. Typically,
the set of dependency labels depends on a particular language.
Nevertheless, many languages may share an important subset
of dependency labels.
The dependency schema developed by the NLP group at
Stanford University defines 53 types of English dependency.
All of them are binary relations where each dependency
defines a relation between the head and its dependent. We

adapt and extend this schema to build a dependency schema
for Vietnamese which takes into account the particularities of
Vietnamese grammar [?]. This schema consists of 48 labels, all
of which are explicitly defined and consistent with Vietnamese
syntax. The most common dependency labels are given below:
• vmod: verb modifier, for example vmod(đi, qua) in (VP
(V-H đi) (V qua));
• rmod: adverb modifier, for example rmod(Xa_xa, nữa) in
(AP (A-H Xa_xa) (R nữa));

1 />2 />
3 />

•

•

dobj: direct object of a verbal phrase, for example
dobj(còn, người) in (VP (R không) (V-H còn) (NP-DOB
(N-H người) (A nghèo)));
pobj: direct object of a prepositional phrase, for example
pobj(bằng, cùi_tay) in (PP-MNR (E-H bằng) (NP (M hai)
(N-H cùi_tay) (A cụt_lủn))).

can be understood as follows: to find the head of a sentence S,
we browse from left to right to find the first element marked as
-H; if there is such element, it will be the root of the sentence,
if not, we find the S element to be the head; if S is not found
we find VP and so on. If there is not any such element, take
the first element from the left as head (".*").

C. Conversion Algorithm

Fig. 2. An example of dependency parsing in Vietnamese

Figure 2 shows a dependency parse which includes the
following dependence relations:
ncdep(đất - 2, Mảnh - 1)
prepc(Mảnh - 1, của - 2)
nsubj(còn - 7, Mảnh - 1)
pobj(của - 3, đạn - 4)
nn(đạn - 4, bom - 5)
neg(còn - 7, không - 6)
Root(ROOT - 0, còn - 7)
dobj(còn - 7, người - 8)
amod(người - 8, nghèo - 9)
punct(còn - 7, . - 10)

The conversion algorithm has two stages. In the first stage,
a constituency parse is constructed from the bracket format
of each sentence of the treebank. For example, the parsed
sentence (S-TTL (NP-SUB (Nc-H Mảnh) (N đất) (PP (E-H của)
(NP (N-H đạn) (N-H bom)))) (VP (R không) (V-H còn) (NPDOB (N-H người) (A nghèo))) (. .)) has the constituency parse
as shown in Figure 3. In the second stage, the constituency
S-STL
NP-SUB
Nc-H

N

Mảnh

đất

In order to determine the head element of each phrase, we
build a head rule table. This table constitutes an important part
of our work. Our head rules follow that presented in [?].
S
SBAR
SQ
NP
VP
AP
RP
PP
QP
XP
YP
MDP
WHNP
WHAP
WHRP
WHPP
WHXP
UCP
WHADV
WHVP

→
→

→
→
→
→
←
→
→
→
→
→
→
→
→
→
→
→
→
→

-H;S;VP;AP;NP;.*
-H;SBAR;S;VP;AP;NP;.*
-H;SQ;VP;AP;NP;.*
-H;NP;Nc;Nu;Np;N;P;.*
-H;VP;V;A;AP;N;NP;S;.*
-H;AP;A;N;S;.*
-H;RP;R;T;NP;.*
-H;PP;E;VP;SBAR;AP;QP;.*
-H;QP;M;.*
-H;XP;X;.*
-H;YP;Y;.*

-H;MDP;T;I;A;P;R;X;.*
-H;WHNP;NP;Nc;Nu;Np;N;P;.*
-H;WHAP;A;N;V;P;X;.*
-H;WHRP;P;E;T;X;.*
-H;WHPP;E;P;X;.*
-H;XP;X;.*
-H;.*
-H;R;.*
-H;V;.*

For example, the rule:
S

→

-H; S; VP; AP; NP; .*

PP
E-H
của

B. Head Rules

VP

NP

R

V-H

không

còn

N-H

N-H

đạn

bom

NP-DOB
N-H

A

người

nghèo

Fig. 3. A constituency parse of a sentence in the Vietnamese treebank.

parse is converted to the dependency one. This stage has three
steps. First, find the head of each phrase in the sentence using
the head rule table (see Algorithm 1). Second, find a label for
each dependency (head, dependent) (see Algorithm 2). Finally,
build all the labeled dependencies using a recursive routines
calling the two previous steps (see Algorithm 3).

D. Results
To evaluate the accuracy of the conversion algorithm, we
first select a subset of 100 sentences from the Vietnamese treebank and manually annotate them with dependency relations.
We then run the conversion algorithm presented above on
these sentences to get dependency parses and compare them to
the manual annotation. The result is very good–the unlabeled
attachment score is of 99.6% and the labeled attachment score
is perfect on matched attachments.
As an example, from the constituency parse (S-TTL (NPSUB (Nc-H Mảnh) (N đất) (PP (E-H của) (NP (N-H đạn) (NH bom)))) (VP (R không) (V-H còn) (NP-DOB (N-H người)
(A nghèo))) (. .)), the automatic conversion algorithm produces
the following dependency parse:

.
.

Algorithm 1 FindHeadP(P, lstHeadRules, lstElements)
Require: P: a phrase; lstElements: list of elements in P;
lstHeadRules: list of head rules
Ensure: head of P
for headRule ∈ lstHeadRules do
if headrule.Phrase=P then
hr ← headRule
break
end if
end for
lstRightHR ← hr.Right
for element ∈ lstElements do
for rightEle ∈ lstRightHR do

if element.Phrase=rightEle or element.Pos=rightEle
then
head ← element
break
end if
end for
end for
return head
Algorithm 2 GetDependentLabel(h, d, lstLabels)
Require: (h, d), where d is a head and d is its dependent;
lstLabels: list of labels
l
Ensure: a dependency label l: h −→ d
for labelele ∈ lstlabel do
left ← GetInformation(h, labelele.Left)
right ← GetInformation(d, labelele.Right)
center ← GetCenterInformation(h, d, labelele.center)
if IsLabel(left,right,center) then
l ← labelele.Label
break
end if
end for
return l

Algorithm 3 ConvertToDP(Root,lstHeadRules,lstLabels,dpTree)
Require: Root: root node of the constituency tree; lstHeadRules: list of head rules; lstLabels: list of dependency labels;
dpTree: saved dependency tree
Ensure: Head of the sentence
if Root=null then
return

end if
if IsLeaf(Root) then
lstElements ← Word(Root)
return FindHeadP(Phrase(Root),lstHeadRules,lstElements)
end if
if AllChildIsLeaf(Root) then
for child ∈ Root do
lstElements ← Word(child)
end for
h ← FindHeadP(Phrase(Root),lstHeadRules,lstElements)
for child ∈ Root do
label ← GetDependencyLabel(h, child, lstLabels)
depTree ← (h, child, label)
end for
return h
end if
lstHeadChilds ← null
for child ∈ Root do
lstHeadChilds ← ConverToDP(Phrase(child),
lstHeadRules,lstLabels, dpTree)
end for
h ← FindHeadP(Phrase(Root),lstHeadRules, lstHeadChilds)
for headchild ∈ lstHeadChild do
label ← GetDependencyLabel(h, headchild, lstLabels)
depTree ← (h, headchild, label)
end for
return h
TABLE I
P ERCENTAGE OF

1
2
3
4
5
6
7
8
9
10

Mảnh
đất
của
đạn
bom
không
còn
người
nghèo
.

Nc
N
E
N
N
R
V
N

A
.

7
1
1
3
4
7
0
7
8
7

nsubj
ncdep
prepc
pobj
nn
neg
Root
dobj
amod
punct

Table I shows the percentage of common labels assigned
to dependencies on all the Vietnamese treebank containing of
about 10,000 sentences.
IV. EXPERIMENTS

WITH

MALTPARSER

In this section, we present parsing experiments on the
Vietnamese dependency treebank constructed in the previous
section. We use MaltParser to train and test dependency

COMMON DEPENDENCY LABELS ON THE
TREEBANK

No.

Label

%

1

vmod

9.95

2

rmod

6.36

3

nsubj

5.81

4

dobj

5.7

5

pobj

6

nn

5.55

7

conj

4.67

V IETNAMESE

5.6

parsing models on the treebank using cross-validation. There
are 10 data sets for training and testing are created. Each round,
500 sentences are randomly selected as test set and the rest is
used to train MaltParser. The configuration of the parser that
we use is as follow:
•

Transition system: Arc-Eager

•

•
•
•

Parser configuration: Nivre with allowroot=true and allow_reduce=false
Feature model: NivreEager.xml
Learner: liblinear
Oracle: Arc-Eager

The experimental results are described in Table II
TABLE II
D EPENDENCY PARSING ACCURACY WITH MALTPARSER
No.

Test (500 sentences)

ASU

ASL

1

1-500

76.43

70.45

2

1001-1500

75.58

68.40

3

2001-2500

72.37

65.12

4

3001-3500

74.16

66.58

5

4001-4500

69.69

63.47

6

5001-5500

74.10

67.42

7

6001-6500

73.49

67.27

8

7001-7500

72.76

65.91

9

8001-8500

69.04

63.16

10

9001-9500

72.82

65.74

Average

73.03

66.35

The average ASU is 73.03% and average ASL is 66.35%.

In these experiments, MaltParser was not optimized for Vietnamese, therefore the accuracy was not high. The accuracy
can be improved by fixing some errors on the dependency
treebank such as: determining the wrong root in the sentences
with many clauses, wrong dependencies of special tokens.
The set of guidelines for dependency annotation needs to be
defined more clearly to improve the quality of dependency
identification.
V. CONCLUSION
There have been several works on constituency parsing
but not many works on dependency parsing for Vietnamese
language as few data exists for training dependency parsers.
However, dependency parsing provides more useful information in natural language processing than constituency parser.
Our work aims to build automatically a Vietnamese dependency treebank from constituency treebanks which exist more
frequently. The dependency label set is defined based on
Vietnamese grammar in a way allowing us to compare directly
our labels with English dependency labels. To do this, the
English dependency label set developed by the NLP group at
Stanford University is used as reference.
Once the Vietnamese dependency treebank of about 10,000
sentences converted from VietTreebank, we have done exeriments on Vietnamese dependency parsing using MaltParser.
The evaluation results give 73.03% for the average ASU and
66.35% for the average ASL . In a first step, these experiment
results help to show some errors in the reference data. In the
next step, we will revise the corpus and carry out experiments
with different parsers to find the best methods for Vietnamese
dependency parsing.

REFERENCES
[1] L. T. Hương, P. H. Quang, and N. T. Thủy, “Một cách tiếp cận trong
việc tự động phân tích cú pháp văn bản tiếng việt,” Tạp chí tin học và

Điều khiển học, vol. 15, no. 4, 2000.
[2] P. T. Nguyen, L. V. Xuan, T. M. H. Nguyen, V. H. Nguyen, and
P. Le-Hong, “Building a large syntactically-annotated corpus of Vietnamese,” in Proceedings of the 3rd Linguistic Annotation Workshop,
ACL-IJCNLP, Singapore, 2009.
[3] N. L. Minh, H. T. Điệp, and T. M. Kế, “Nghiên cứu luật hiệu chỉnh kết
quả dùng phương pháp MST phân tích cú pháp phụ thuộc tiếng việt,”
in ICT-rda 8, Hanoi, Vietnam, 2008, pp. 258–267.
[4] P. Le-Hong, T. M. H. Nguyen, and R. Azim, “Vietnamese parsing
with an automatically extracted tree-adjoining grammar,” in Proceedings
of the IEEE International Conference in Computer Science: Research,
Innovation and Vision of the Future, RIVF, HCMC, Vietnam, 2012.
[5] J. Nivre, J. Hall, J. Nilsson, A. Chanev, G. Eryigit, S. Kubler, S. Marinov,
and E. Marsi, “Maltparser: A language-independent system for datadriven dependency parsing,” Natural Language Engineering, vol. 13,
no. 2, pp. 95–135, 2007.
[6] M.-C. de Marneffe, B. MacCartney, and C. D. Manning, “Generating
typed dependency parses from phrase structure parses,” in Proceedings
of LREC 2006, Genoa, Italy, 2006.
[7] R. McDonald, K. Lerman, and F. Pereira, “Multilingual dependency
parsing with a two-stage discriminative parser,” in Proceedings of the
Tenth Conference on Computational Natural Language Learning, 2006.
[8] R. McDonald, K. Crammer, and F. Pereira, “Online large-margin training
of dependency parsers,” in Proceedings of the 43rd Annual Meeting of
the Association for Computational Linguistics, 2005.
[9] Q. B. Diệp and V. T. Hoàng, Ngữ pháp Tiếng Việt (Vietnamese
Grammar). NXB Giáo dục, Hà Nội, Việt Nam, 1999.
[10] P. Le-Hong, T. M. H. Nguyen, P. T. Nguyen, and A. Roussanaly,
“Automated extraction of tree adjoining grammars from a treebank for
Vietnamese,” in Proceedings of The Tenth International Workshop on
Tree Adjoining Grammars and Related Formalisms (TAG+10), Yale
University, New Haven, CT, USA, 2010.

Building a treebank for vietnamese depen

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về