DSpace at VNU: Building a treebank for Vietnamese dependency parsing

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (550.73 KB, 5 trang )

2013 IEEE RIVF International Conference on Computing & Communication Technologies Research, Innovation, and Vision for the Future (RIVF)

Building a Treebank for Vietnamese
Dependency Parsing
Luong Nguyen Thi
Dalat University
Lamdong, Vietnam
Email:

Linh Ha My, Hung Nguyen Viet,
Huyen Nguyen Thi Minh, Phuong Le Hong
VNU University of Science
Hanoi, Vietnam
Email: , ,
,

Abstract—The problem of Vietnamese syntactic parsing, especially constituency parsing, has recently been tackled by several
research groups. A common effort of the Vietnamese language
processing community has allowed the creation of VietTreebank,
a reference parsed corpus containing about 10,000 sentences for
the constituency parsing task. In this paper, we present our
work to build a reference treebank, based on VietTreebank, for
the dependency parsing task, which has not yet been very well
studied for Vietnamese. First we define a dependency label set by
adapting the dependency schema developed by the NLP group at
Stanford university and taking into account the particularities of
Vietnamese grammar. Then we propose an algorithm to convert
a constituency treebank to a dependency one. The algorithm is
tested on a set of 100 sentences of VietTreebank corpus and
gives very good results. Finally, we carry out an experiment on
Vietnamese dependency parsing using MaltParser tool and the

dependency treebank converted from VietTreebank.

I. INTRODUCTION
Dependency parsing has been one interesting approach to
syntactic parsing in recent years. The basic idea of dependency
parsing is to find the syntactic structure which consists of
lexical items, linked by binary asymmetric relations called
dependencies. There have been many studies on dependency
parsing. Many tools have been developed to solve this problem.
Especially, methods based on machine learning give high
accuracy parsing results on English, Chinese or Swedish.
For Vietnamese, most studies centered on constituency
parsing such as [1], [2]. The Vietnamese treebank reported
in [2] consists of about 10,000 sentences in Penn treebank
format. For dependency parsing, there exists only two works,
one of Nguyễn Lê Minh et al. [3] which uses MST parser on
a corpus consisting of 450 sentences, and one of Lê Hồng
Phương et al. [4], which uses a lexicalized tree-adjoining
grammar parser trained on a subset of the Vietnamese treebank.
In this paper, we report on our work on building a large
corpus for Vietnamese dependency parsing. We first develop
algorithms for converting from constituency structure to dependency structure. We then use the resulting dependency treebank to train MaltParser - a language-independent dependency
parser [5] and report the parsing results.
This paper is organized as follows. The next section introduces dependency parsing where basic concepts and some
existing works are given. The following section presents the
construction of a Vietnamese dependency treebank. Finally,
978-1-4799-1350-3/13/$31.00 ©2013 IEEE

147

the last section reports experimental results on Vietnamese
dependency parsing with MaltParser.
II. DEPENDENCY PARSING
A. Definition
The dependency parsing of a sentence consists in determining the binary asymmetric relations, called dependencies,
between its lexical elements. A dependency relation between
two tokens can be named to clarify the relationship between
them.
Dependency structure is determined by the relationship
between the center token (head) and its dependent token
(dependent), denoted by an arrow. By convention, the root of
the arrow is the head, and the top of the arrow is the dependent.
In comparison to constituency structure, dependency structure
is more appropriate to represent syntactic structures of free
languages, such as Czech or Turkish.
In dependency parsing, each syntactic parse of a sentence
can be represented by a dependency graph. A dependency
graph is a graph where each node is a token of the sentence.
Arcs (edges) of the graph are used to represent dependency
relationship between two nodes and the name of the arc is
dependency label between those nodes.
For example, consider an English sentence: "Bills on ports
and immigration were submitted by Senator Brownback, Republican of Kansas". Figure 1 shows its dependency graph
containing 13 nodes corresponding to 13 words and 12 relationships connecting these words. The relationships presented
in the sentence are prep(Bills, on), pobj(on, ports). . . [6].
By convention, a special node that does not correspond to
any token in the sentence is introduced to represent the root
of the dependency graph.
Dependency parsing is the problem of constructing the
most probable dependency graph for a given input sentence.

The input of a dependency parser is a tokenized and part-ofspeech tagged sentence. Most studies on dependency parsing
employ machine learning techniques. To build a a supervised
dependency parser for a language, we need a large dependency
treebank of that language.
B. Related Works
Recently dependency parsing has received the attention
of many research groups. There have been many studies

submitted
auxpass

nsubjpass

Bills

ports

prep

by

were

prep

prep

on

by

pobj cc

and

For Vietnamese, few works on dependency parsing exist
because of the lack of training dependency treebank. In [3],
MST was used to parse dependency structures in Vietnamese
text. Experiments conducted on 450 Vietnamese sentences
(POS tagged) give an accuracy of ASU = 67.7%, and
of ASL = 63.11%. Each dependence is assigned a label
by automatic scoring algorithm in MST. No concrete label
definition is given. In [4], dependencies were determined
from derivation trees by TAG parsing. Each word in the
sentence is represented by a elementary tree. Derivation trees
were constructed from these elementary trees and converted to
dependencies by transforming each derivation operation into a
dependency relation with label. There were 13 labels divided
into 3 types: arg (relationship between a head word and its
argument), mod (modification relation between a word and its
head word), coord (relationship between two lexical heads of
two coordinating phrases within a conjunction).

conj

pobj

immigration

Brownback
nn

Senator

appos

Republican
prep pobj

of

Fig. 1.

As we can see, the most important step to develop a
dependency parser for Vietnamese is to build a reference
dependency treebank. The definition of a dependency label
set is essential for this task. In the next section, we present
our work on constructing a Vietnamese dependency treebank.

Kansas

Dependency graph of an English sentence.

III. BUILDING VIETNAMESE DEPENDENCY TREEBANK
and tools for dependency parsing: MaltParser, StanfordParser,
MSTParser. . . Most dependency parsing tools achieve high
accuracy and are suitable for many languages, such as English, Chinese, German, Czech. . . The accuracy of a parser
is evaluated using two indices: unlabeled attachment score,
which is the proportion of correct head - ASU , and labeled

attachment score, which is the proportion of correct head and
correct dependency type - ASL .
1) MSTParser: MSTParser is developed by Ryan McDonald et al [7]. MSTParser has two processes: training and
analysis. In training, MSTParser uses on-line algorithms [8].
In analysis, MSTParser uses a graph-based algorithm. The
accuracy of MSTParser on a variety of languages is quite high:
ASU = 92.8%, ASL = 90.7% for Japanese, ASU = 91.1%,
ASL = 85.9% for Chinese, ASU = 90.4%, ASL = 87.3% for
German. . . 1
2) Stanford Parser: Stanford Parser is developed by NLP
group at Stanford University. Stanford Parser defines 53 dependency types for English based on Penn Treebank [6].
The accuracy of the parser is quite high, in particular for
English ASU = 87.2% and ASL = 84.2%. This parser has
been extended to parse languages other than English, such as
Chinese, German, French and Arabic.2
3) MaltParser: MaltParser is developed by Johan Hall et al.
MaltParser is the most effective dependency parsing tool, with
high accuracy for more than 20 languages. MaltParser has two
processes: training and analysis. In training, MaltParser uses
support vector machines algorithm. In analysis, MaltParser
uses a transition-based algorithm. The accuracy of the tool is
high, for example ASU = 88.1%, ASL = 86.3% for English
and ASU = 88.1%, ASL = 83.4% for German.3
1 />
To build a dependency treebank for Vietnamese, we first
define a dependency scheme specific to this language. Then
we design an algorithm to convert the available Vietnamese
constituency treebank [2] to a dependency treebank.
The orgininal constituency treebank is a corpus containing
about 10,000 sentences in Penn treebank format. An example

sentence is (S-TTL (NP-SUB (Nc-H Mảnh) (N đất) (PP (E-H
của) (NP (N-H đạn) (N-H bom)))) (VP (R không) (V-H còn)
(NP-DOB (N-H người) (A nghèo))) (. .))4 , where
•

S, NP, PP are the labels of phrases and clauses;

•

Nc, N, R are the labels of tokens;

•

SUB, H, DOB,... are the functional syntactic labels of
phrases, clauses or tokens.

The converting algorithm has two steps: (1) determining all
the dependencies in the sentence and (2) labeling the dependency relations. The first step is solved by determining the
central element (head element) of all grammatical phrases and
clauses using head rules. The second step is done by using a
dependency label set and a rule for labeling dependencies.
A. Dependency Schema
Different dependency labels represent different types of
relationships between pairs of tokens of a sentence. Typically,
the set of dependency labels depends on a particular language.
Nevertheless, many languages may share an important subset
of dependency labels.
The dependency schema developed by the NLP group at
Stanford University defines 53 types of English dependency.
All of them are binary relations where each dependency

defines a relation between the head and its dependent. We

2 />4 What

3 />
148

used to be the land of bombs was no longer the land of the poor.

adapt and extend this schema to build a dependency schema
for Vietnamese which takes into account the particularities of
Vietnamese grammar [9]. This schema consists of 48 labels, all
of which are explicitly defined and consistent with Vietnamese
syntax. The most common dependency labels are given below:
•

vmod: verb modifier, for example vmod(đi, qua) in
(VP (V-H đi) (V qua));

•

rmod: adverb modifier, for example rmod(Xa xa, nữa)
in (AP (A-H Xa xa) (R nữa));

•

dobj: direct object of a verbal phrase, for example
dobj(còn, người) in (VP (R không) (V-H còn) (NPDOB (N-H người) (A nghèo)));

•

pobj: direct object of a prepositional phrase, for example pobj(bằng, cùi_tay) in (PP-MNR (E-H bằng) (NP
(M hai) (N-H cùi_tay) (A cụt_lủn))).

S
SBAR
SQ
NP
VP
AP
RP
PP
QP
XP
YP
MDP
WHNP
WHAP
WHRP
WHPP
WHXP
UCP
WHADV
WHVP

→
→
→
→

→
→
←
→
→
→
→
→
→
→
→
→
→
→
→
→

-H;S;VP;AP;NP;.*
-H;SBAR;S;VP;AP;NP;.*
-H;SQ;VP;AP;NP;.*
-H;NP;Nc;Nu;Np;N;P;.*
-H;VP;V;A;AP;N;NP;S;.*
-H;AP;A;N;S;.*
-H;RP;R;T;NP;.*
-H;PP;E;VP;SBAR;AP;QP;.*
-H;QP;M;.*
-H;XP;X;.*
-H;YP;Y;.*
-H;MDP;T;I;A;P;R;X;.*
-H;WHNP;NP;Nc;Nu;Np;N;P;.*

-H;WHAP;A;N;V;P;X;.*
-H;WHRP;P;E;T;X;.*
-H;WHPP;E;P;X;.*
-H;XP;X;.*
-H;.*
-H;R;.*
-H;V;.*

For example, the rule:
VP

→

-H;VP;V;A;AP;N;NP;S;.*

nsubj

ROOT-0
punct
prepc
ncdep

Mảnh-1

Fig. 2.

đất-2

của-3

neg

nn

pobj

đạn-4

bom-5

không-6

dobj

còn-7

người-8

amod

nghèo-9

.-10

An example of dependency parsing in Vietnamese

Figure 2 shows a dependency parse of the sentence "Mảnh
đất của đạn bom không còn người nghèo". In this figure, an
edge from "Mảnh" to "đất" indicates that "đất" is the modifier
of "mảnh". The label of this edge is the relationship name

between them.

can be understood as follows: to find the head of a VP phrase,
we browse from left to right to find the first element marked
as -H; if there is such element, it will be the head of the VP
phrase, if not, we find the VP element to be the head; if VP
is not found we find V and so on. If there is not any such
element, take the first element from the left as head (".*").
The following example will describe how to find the head
in a phrase: (VP (R không) (V-H còn) (NP-DOB (N-H người)
(A nghèo)). First, we need to find the head rule for VP phrase
in the list of head rules. The head rule of VP phrase is:
VP

→

-H;VP;V;A;AP;N;NP;S;.*

Second, we need to browse from left to right in the head rule
for VP phrase to find the first element marked as -H which is
(V-H còn). That means the token "còn" is the head of this VP
phrase.

All dependency relations of this sentence are:
ncdep(Mảnh - 1,đất - 2)
prepc(Mảnh - 1, của - 2)
nsubj(còn - 7, Mảnh - 1)
pobj(của - 3, đạn - 4)
nn(đạn - 4, bom - 5)
neg(còn - 7, không - 6)

Root(ROOT - 0, còn - 7)
dobj(còn - 7, người - 8)
amod(người - 8, nghèo - 9)
punct(còn - 7, . - 10)

C. Conversion Algorithm
The conversion algorithm has two stages. In the first stage,
a constituency parse is constructed from the bracket format
of each sentence of the treebank. For example, the parsed
sentence (S-TTL (NP-SUB (Nc-H Mảnh) (N đất) (PP (E-H của)
(NP (N-H đạn) (N-H bom)))) (VP (R không) (V-H còn) (NPDOB (N-H người) (A nghèo))) (. .)) has the constituency parse
as shown in Figure 3. In the second stage, the constituency
parse is converted to the dependency one. This stage has three
steps. First, find the head of each phrase in the sentence using
the head rule table (see Algorithm 1). Second, find a label for
each dependency (head, dependent) (see Algorithm 2). Finally,
build all the labeled dependencies using a recursive routine
calling the two previous steps (see Algorithm 3).

B. Head Rules
In order to determine the head element of each phrase, we
build a head rule table. This table constitutes an important part
of our work. Our head rules follow that presented in [10].
149

D. Results
To evaluate the accuracy of the conversion algorithm, we
first select a subset of 100 sentences from the Vietnamese

S-STL
NP-SUB
Nc-H

N

Mảnh

đất

PP
E-H
của

Fig. 3.

VP

NP

R

V-H

không

còn

N-H

N-H

đạn

bom

NP-DOB

.

N-H

A

.

người

nghèo

A constituency parse of a sentence in the Vietnamese treebank.

Algorithm 1 FindHeadP(P, lstHeadRules, lstElements)
Require: P: a phrase; lstElements: list of elements in P;
lstHeadRules: list of head rules
Ensure: head of P
for headRule ∈ lstHeadRules do
if headrule.Phrase=P then
hr ← headRule
break

end if
end for
lstRightHR ← hr.Right
for element ∈ lstElements do
for rightEle ∈ lstRightHR do
if element.Phrase=rightEle or element.Pos=rightEle
then
head ← element
break
end if
end for
end for
return head

treebank and manually annotate them with dependency relations. We then run the conversion algorithm presented above on
these sentences to get dependency parses and compare them to
the manual annotation. The result is very good–the unlabeled
attachment score is of 99.6% and the labeled attachment score
is perfect on matched attachments.
Algorithm 2 GetDependentLabel(h, d, lstLabels)
Require: (h, d), where d is a head and d is its dependent;
lstLabels: list of labels
l
Ensure: a dependency label l: h −→ d
for labelele ∈ lstlabel do
lef t ← GetInf ormation(h, labelele.Lef t)
right ← GetInf ormation(d, labelele.Right)
center ← GetCenterInf ormation(h, d, labelele.center)
if IsLabel(lef t, right, center) then
l ← labelele.Label

break
end if
end for
return l
150

Algorithm 3 ConvertToDP(Root,lstHeadRules,lstLabels,dpTree)
Require: Root: root node of the constituency tree; lstHeadRules: list of head rules; lstLabels: list of dependency labels;
dpTree: saved dependency tree
Ensure: Head of the sentence
if Root=null then
return
end if
if IsLeaf(Root) then
lstElements ← Word(Root)
return FindHeadP(Phrase(Root),lstHeadRules,lstElements)
end if
if AllChildIsLeaf(Root) then
for child ∈ Root do
lstElements ← Word(child)
end for
h ← FindHeadP(Phrase(Root),lstHeadRules,lstElements)
for child ∈ Root do
label ← GetDependencyLabel(h, child, lstLabels)
depTree ← (h, child, label)
end for
return h
end if
lstHeadChilds ← null
for child ∈ Root do

lstHeadChilds ← ConverToDP(Phrase(child),
lstHeadRules,lstLabels, dpTree)
end for
h ← FindHeadP(Phrase(Root),lstHeadRules, lstHeadChilds)
for headchild ∈ lstHeadChild do
label ← GetDependencyLabel(h, headchild, lstLabels)
depTree ← (h, headchild, label)
end for
return h
As an example, from the constituency parse (S-TTL (NPSUB (Nc-H Mảnh) (N đất) (PP (E-H của) (NP (N-H đạn) (N-H
bom)))) (VP (R không) (V-H còn) (NP-DOB (N-H người) (A
nghèo))) (. .)), the automatic conversion algorithm produces
the following dependency parse:
1
2
3
4
5
6
7
8
9
10

Mảnh
đất
của
đạn
bom
không

còn
người
nghèo
.

Nc
N
E
N
N
R
V
N
A
.

7
1
1
3
4
7
0
7
8
7

nsubj
ncdep
prepc

pobj
nn
neg
Root
dobj
amod
punct

Table I shows the percentage of common labels assigned
to dependencies on all the Vietnamese treebank containing of
about 10,000 sentences.
IV. EXPERIMENTS

WITH

MALTPARSER

In this section, we present parsing experiments on the
Vietnamese dependency treebank constructed in the previous
section. We use MaltParser to train and test dependency

TABLE I.

P ERCENTAGE OF COMMON DEPENDENCY LABELS ON THE
V IETNAMESE TREEBANK
No.

Label

%

1

vmod

9.95

2

rmod

6.36

3

nsubj

5.81

4

dobj

5.7

5

pobj

5.6

6

nn

5.55

7

conj

4.67

parsing models on the treebank using cross-validation. 10 data
sets are created for training and testing. Each round, 500
sentences are randomly selected as test set and the rest is used
to train MaltParser. The configuration of the parser that we
use is as follow:
•

Transition system: Arc-Eager

•

Parser configuration: Nivre with allowroot=true and
allow_reduce=false

•

Feature model: NivreEager.xml

•

Learner: liblinear

•

Oracle: Arc-Eager

The experimental results are described in Table II
TABLE II.

D EPENDENCY PARSING ACCURACY WITH MALTPARSER
No.

Test (500 sentences)

ASU

ASL

1

1-500

76.43

70.45

2

1001-1500

75.58

68.40

3

2001-2500

72.37

65.12

4

3001-3500

74.16

66.58

5

4001-4500

69.69

63.47

6

5001-5500

74.10

67.42

7

6001-6500

73.49

67.27

8

7001-7500

72.76

65.91

9

8001-8500

69.04

63.16

10

9001-9500

72.82

65.74

Average

73.03

66.35

The average ASU is 73.03% and average ASL is 66.35%.
In these experiments, MaltParser was not optimized for Vietnamese, therefore the accuracy was not high. The accuracy
can be improved by fixing some errors on the dependency
treebank such as: determining the wrong root in the sentences
with many clauses, or wrong dependencies of special tokens.
The set of guidelines for dependency annotation needs to be
defined more clearly to improve the quality of dependency
identification.
V. CONCLUSION
There has been several works on constituency parsing
but not many works on dependency parsing for Vietnamese
language as few data exist for training dependency parsers.

However, dependency parsing provides more useful information in natural language processing field than constituency
parser. Our work aims to automatically build Vietnamese
151

dependency treebank from constituency treebanks which exist
more frequently. The dependency label set is defined based
on Vietnamese grammar in a way allowing us to compare
directly our labels with English dependency labels. To do this,
the English dependency label set developed by the NLP group
at Stanford University is used as reference.
Once the Vietnamese dependency treebank of about 10,000
setences converted from VietTreebank, we have done experiments on Vietnamese dependency parsing using MaltParser.
The evaluation results give 73.03% for the average ASU and
66.35% for the average ASL . In a first step, these experiment
results help to show some errors in the reference data. In the
next step, we will revise the corpus and carry out experiments
with different parsers to find the best methods for Vietnamese
dependency parsing.
ACKNOWLEDGMENT
This work is supported by the VNU research grant
QG.12.22.
REFERENCES
[1] L. T. Hương, P. H. Quang, and N. T. Thủy, “Một cách tiếp cận trong
việc tự động phân tích cú pháp văn bản tiếng việt,” Tạp chí tin học và
Điều khiển học, vol. 15, no. 4, 2000.
[2] P. T. Nguyen, L. V. Xuan, T. M. H. Nguyen, V. H. Nguyen, and P. LeHong, “Building a large syntactically-annotated corpus of Vietnamese,”
in Proceedings of the 3rd Linguistic Annotation Workshop, ACLIJCNLP, Singapore, 2009.
[3] N. L. Minh, H. T. Điệp, and T. M. Kế, “Nghiên cứu luật hiệu chỉnh kết
quả dùng phương pháp MST phân tích cú pháp phụ thuộc tiếng việt,”
in ICT-rda 8, Hanoi, Vietnam, 2008, pp. 258–267.

[4] P. Le-Hong, T. M. H. Nguyen, and R. Azim, “Vietnamese parsing with
an automatically extracted tree-adjoining grammar,” in Proceedings of
the IEEE International Conference in Computer Science: Research,
Innovation and Vision of the Future, RIVF, HCMC, Vietnam, 2012.
[5] J. Nivre, J. Hall, J. Nilsson, A. Chanev, G. Eryigit, S. Kubler,
S. Marinov, and E. Marsi, “Maltparser: A language-independent system
for data-driven dependency parsing,” Natural Language Engineering,
vol. 13, no. 2, pp. 95–135, 2007.
[6] M.-C. de Marneffe, B. MacCartney, and C. D. Manning, “Generating
typed dependency parses from phrase structure parses,” in Proceedings
of LREC 2006, Genoa, Italy, 2006.
[7] R. McDonald, K. Lerman, and F. Pereira, “Multilingual dependency
parsing with a two-stage discriminative parser,” in Proceedings of the
Tenth Conference on Computational Natural Language Learning, 2006.
[8] R. McDonald, K. Crammer, and F. Pereira, “Online large-margin
training of dependency parsers,” in Proceedings of the 43rd Annual
Meeting of the Association for Computational Linguistics, 2005.
[9] Q. B. Diệp and V. T. Hoàng, Ngữ pháp Tiếng Việt (Vietnamese
Grammar). NXB Giáo dục, Hà Nội, Việt Nam, 1999.
[10] P. Le-Hong, T. M. H. Nguyen, P. T. Nguyen, and A. Roussanaly,
“Automated extraction of tree adjoining grammars from a treebank
for Vietnamese,” in Proceedings of The Tenth International Workshop
on Tree Adjoining Grammars and Related Formalisms (TAG+10), Yale
University, New Haven, CT, USA, 2010.

DSpace at VNU: Building a treebank for Vietnamese dependency parsing

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về