Tải bản đầy đủ (.pdf) (36 trang)

Integrated linguistic to Statistical Machine Translation = Tích hợp thông tin ngôn ngữ vào dịch máy tính thống kê

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (613.51 KB, 36 trang )

VIETNAM NATIONAL UNIVERSITY HANOI
UNIVERSITY OF ENGINEERING AND
TECHNOLOGY
HOAI-THU VUONG
INTEGRATED LINGUISTIC TO
STATISTICAL MACHINE
TRANSLATION
MASTER THESIS
HANOI - 2012
Contents
1 Introduction 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 A Short Comparison Between English and Vietnamese . . . . . . . 2
1.2 Machine Translation Approaches . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Interlingua . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Transfer-based Machine Translation . . . . . . . . . . . . . . . . . . 3
1.2.3 Direct Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 The Reordering Problem and Motivations . . . . . . . . . . . . . . . . . . 5
1.4 Main Contributions of this Thesis . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Related works 7
2.1 Phrase-based Translation Models . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Type of orientation phrases . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 The Distance Based Reordering Model . . . . . . . . . . . . . . . . 9
2.3 The Lexical Reordering Model . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 The Preprocessing Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Translation Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5.1 Automatic Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5.2 NIST Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.3 Other scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.4 Human Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . 13


2.6 Moses Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Shallow Processing for SMT 15
3.1 Our proposal model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 The Shallow Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.1 Definition of the shallow syntax . . . . . . . . . . . . . . . . . . . . 16
3.2.2 How to build the shallow syntax . . . . . . . . . . . . . . . . . . . . 17
3.3 The Transformation Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4 Applying the transformation rule into the shallow syntax tree . . . . . . . 19
4 Experiments 21
4.1 The bilingual corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2 Implementation and Experiments Setup . . . . . . . . . . . . . . . . . . . 21
4.3 BLEU Score and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5 Conclusion and Future Work 25
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
ii Contents
Appendix A A hand written of the transformation rules 27
Appendix B Script to train the baseline model 29
Bibliography 31
List of Tables
1 Corpus Statistical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2 Details of our experimental, AR is named as using automatic rules, MR is
named as using handwritten rules . . . . . . . . . . . . . . . . . . . . . . . 22
3 Size of phrase tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4 Translation performance for the English-Vietnamese task . . . . . . . . . . 23
List of Figures
1 The machine translation pyramid . . . . . . . . . . . . . . . . . . . . . . . 3
2 The concept architecture of Moses Decoder . . . . . . . . . . . . . . . . . . 14
3 An overview of preprocess before training and decoding. . . . . . . . . . . 15
4 A pair of source and target language . . . . . . . . . . . . . . . . . . . . . 15

5 The training process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
6 The decoding process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
7 A shallow syntax tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
8 The building of the shallow syntax . . . . . . . . . . . . . . . . . . . . . . 18
9 The building of the shallow syntax . . . . . . . . . . . . . . . . . . . . . . 20
Chapter 1
Introduction
In this chapter, we would like to give a brief of Statistical Machine Translation (SMT),
to address the problem, the motivations of our work, and the main contributions of
this thesis. Firstly, we introduce the Machine Translation (MT), which is one of big
applications in Natural Language Processing (NLP) and an approach to solve this problem
by using statistical. Then, we also introduce the main problem of this thesis and our
research motivations. The next section will describe the main contributions of this thesis.
Finally, the content of this thesis will be outlined.
1.1 Overview
In the field of NLP, MT is a big application to help a user translate automatically a
sentence from one language to another language. MT is very useful in real life: MT
help us surf the website in foreign languages, which we don’t understand, or help you
understand the content of an advertising board on the street. However, the high quality
MT is still challenges for researchers. Firstly, the reason comes from the ambiguity of
natural language at various levels. At lexical level, we have problem with the morphology
of the word such as the word tense or word segmentation, such as Vietnamese, Japanese,
Chinese or Thai, in which there is no symbol to separate two words. For an example, in
Vietnamese, we have a sentence "học sinh học sinh học.", "học"is a verb, which means
”study” in English, "học sinh"is a noun, which means a pupil or student in English,
"sinh học"is a noun, which means a subject (biology) in English. At the syntax level,
we have the ambiguity of coordinate. For example, we have another sentence the man
saws the girl with the telescope. We can understand that the man used the telescope
to see the girl or the girl with the telescope is seen by the man. So on, the ambiguity is
more difficult in the semantic level.

Secondly, Jurafsky and Martin (2009) shows that there are some differences in a pair
of language such as the difference in structure, lexical, etc , which make MT become
challenges.
Specially, one of the differences between two languages, which we want to aim on
this thesis, is the order of words in each language. For example, English is a type of
Subject-Verb-Object (SVO) language, which means subject comes first, then verb follows
the subject and the end of the sentence is Object. In the sentence ”I go to school”,
”I” is its subject, the verb is go to and the object is school. The different from English,
Japanese is a type of SOV language, or Classical Arabic is VSO language.
In the past, the rule-based method were favorite. They built MT system with some
manually rules, which are created by human, so that, in closed domain or restricted area,
2 Chapter 1. Introduction
the quality of rule based system is very high. However, with the increase of internet and
social network, we need a wide broad of MT system, and the rule based method is not
suitable. So that, we need a new way to help the MT, and the statistic is applied to
the field of MT. At the same time, the statistical method is applied in many studies:
automatic speech recognition, etc So that, the idea of using statistical for MT has been
coming out. Nowadays, there are some MT systems, in which statistical method is used,
can compare with human translation such as GOOGLE
1
1.1.1 A Short Comparison Between English and Vietnamese
English and Vietnamese have some similarities such as they base on the Latin character
or are the type of SVO structure. For an example:
en: I go to school.
vn: Tôi i học.
But the order of words in an English noun phrase is different from that in a Vietnamese
one. For example:
en: a black hat
vn: một mũ màu_ en
In the above English example, hat is the head of the noun phrase and it stands at the

end of the phrase. And in Vietnamese, mũ is also the head noun, but it is in the middle
of phrase. The reorder of words can be seen in wh-question, too.
en: what is your job?
vn: công_việc của anh là gì ?
In this example, the word what mean gì in Vietnamese. The position of these two words
can be easil seen. Because, English follows with S-Structure and Vietnamese follows with
D-Structure.
1.2 Machine Translation Approaches
In this section, we would like to give a short of approaches in the field of machine trans-
lation. We would like to begin with complex method (interlingua) and en with simple
one (direct method). From a source sentence, we use some analyzing methods to get the
complex structures, and then generate the structures or sentences in the target language.
The highest complex structure is the interlingua language (figure 1).
1

1.2. Machine Translation Approaches 3
Figure 1: The machine translation pyramid
1.2.1 Interlingua
The interlingua systems (Farwell and Wilks, 1991; Mitamura, 1999) are based on the idea
of finding a language, which called interlingua language to represent the source language
and is easy enough to generate the sentence in other language. In the figure 1, we can see
the process of this approach. The analyzing method is the understanding process, in this
step, from source sentence we can use some technical in NLP to map source sentence to
data structure in the interlingua, then retrieve the target sentence by generating process.
The problem is how complex the interlingua is. If the interlingua is simple, we can get
many translation options. In other way, the more complex the interlingua is, the more
cost effort the analyzing and the generating are.
1.2.2 Transfer-based Machine Translation
Another approach is analyzing the complex structure (simpler than interlingua structure),
then using some transfer rules to get the similar structure in the target language. Then

generating the target sentence. On this model, MT involves three phrases: analysis,
transfer and generation. Normally, we can use all three phrases. However, we some-
times use two of three phrases such as transfer from the source sentence to the structure
in target language then generate the target sentence. For example, we would like to
introduce a simple transfer rule to translate source sentence to the target sentence
2
[Nominal → AdjNoun]
source language
⇒ [Nominal → N ounAdj]
target language
.
2
This example is take from Jurafsky and Martin (2009)
4 Chapter 1. Introduction
1.2.3 Direct Translation
1.2.3.1 Example-based Machine Translation
Example based machine translation was first introduced by Nagao (1984), the author
used a bilingual corpus with parallel texts as its main knowledge base, at run time. The
idea is behind it, is finding the pattern in the bilingual and combining with the parallel
text to generate the new target sentence. This method is similar with the process in
human brain. Finally, the problem of example based machine translation comes from the
matching criteria, the length of the fragments, etc
1.2.3.2 Statistical Machine Translation
Extending the idea of using statistical for speech recognition, Brown et al. (1990, 1993)
introduced the method using statistical, a version of noisy channel to MT. Applyied noisy
channel to machine translation, the target sentence is transformed to the source sentence
by noisy channel. We can represent MT problem as three tasks of noisy channel:
forward task: compute the fluency of the target sentence
learning task: from parallel corpus find the conditional probability between the target
sentence and the source sentence

decoding task: find the best target sentence from source sentence
So that the decoding task can be represented as this formula:
ˆe = arg max
e
P r(e|f)
Applying the Bayes rule, we have:
ˆe = arg max
e
P r(f|e) ∗ P r(e)
P r(f)
Because of the same denominator, we have:
ˆe = arg max
e
P r(f|e) ∗ P r(e)
(Jurafsky and Martin, 2000, 2009) define the Pr (e) as the fluency of the target sen-
tence, known as the language model. It is usually modeled by n-gram or n-th Markov
model. The P r(f |e) is defined as the faithfulness between the source and target language.
We use the alignment model to compute this value base on the unit of the SMT. Basing
on the definition of the translation unit we have some of approaches:
• word based: using word as a translation unit (Brown et al., 1993)
• phrase based: using phrase as a translation unit (Koehn et al., 2003)
• syntax based: using a syntax as a translation unit (Yamada and Knight, 2001)
1.3. The Reordering Problem and Motivations 5
1.3 The Reordering Problem and Motivations
In the field of MT, the reordering problem is the task to reorder the words, in the target
language, to get the best target sentence. Sometimes, we call the reordering model as
distortion model.
Phrase-based Statistical Machine Translation (PBSMT), which was introduced by
Koehn et al. (2003); Och and Ney (2004), is currently the state of the art model in word
choice and local word reordering. The translation unit is the sequence of words without

linguistic information. So that, in this thesis, we would like integrate some linguistic
information such as a chunking, a syntax shallow tree or transformation rule and with a
special aim at solving the global reordering problem.
There are some studies on integrating syntactic resources within SMT. Chiang Chi-
ang (2005) shows significant improvement by keeping the strengths of phrases, while
incorporating syntax into SMT. Chiang (2005) built a kind of the syntax tree based on
synchronous Context Free Grammar (CFG), known as the hierarchical of phrase. Chiang
(2005) used log linear model to determine the weighted of extracted rules and devel-
oped various of CYK algorithm to implement decoding. So that, the reordering phrase is
defined by the synchronous CFG.
Some approaches have been applied at the word-level (Collins et al., 2005). They are
particularly useful for language with rich morphology, for reducing data sparseness. Other
kinds of syntax reordering methods require parser trees , such as the work in Quirk et al.
(2005); Collins et al. (2005); Huang and Mi (2010). The parsed tree is more powerful in
capturing the sentence structure. However, it is expensive to create tree structure, and
building a good quality parser is also a hard task. All the above approaches require much
decoding time, which is expensive.
The approach we are interested in here is to balance the quality of translation with de-
coding time. Reordering approaches such as a preprocessing step Xia and McCord (2004);
Xu et al. (2009); Talbot et al. (2011); Katz-Brown et al. (2011) is very effective (improve-
ment significant over state of-the-art phrase-based and hierarchical machine translation
systems and separately quality evaluation of reordering models).
1.4 Main Contributions of this Thesis
Inspiring this preprocess approach, we have proposed a combination approach which pre-
serves the strength of phrase-based SMT in local reordering and decoding time as well as
the strength of integrating syntax in reordering. As the result, we use an intermediate
syntax between the Parts of Speech (POS) tag and parse tree: shallow parsing. Firstly,
we use shallow parsing for preprocess with training and testing. Secondly, we apply a
series of transformation rules to the shallow tree. We have get two sets of transformation
rules: the first set is written by hand, and the other is extracted automatically from the

bilingual corpus. The experiment results from English-Vietnamese pair showed that our
approach achieves significant improvements over MOSES, which is the state-of-the art
phrase based system.
6 Chapter 1. Introduction
1.5 Thesis Organization
Chapter 2 will describe the other studies, which is related to our method. Our method
will be introduced in chapter 3. In the next chapter (chapter 4), we would give the detail
of some experiments and the results to prove the effect of our method. The sum up of our
study will be seen in chapter 5. This chapter will also discuss about the future works.
Chapter 2
Related works
In this chapter, we would give some background knowledge and a little review on the MT.
According to the short description in the chapter 1, the target translation of MT is the
result of finding the maximum argument for the production of the faithfulness (P r(f|e))
and the fluency of the target sentence(P r(e)). The fluency of the target sentence knows
as the language model, can be modeled by n-gram. The faithfulness can be modeled by a
translation unit with an alignment model between two languages. The alignment model
can be extracted automatically from the bilingual corpus (Brown et al., 1990, 1993; Och
and Ney, 2003). With many kind of translation units, we have some methods:
• Word-based translation models: using word as the translation unit
• Phrase-based translation models: using phrase as the translation unit
• Syntax-based translation models: using syntax as the translation unit
By passing the word-based translation models, we will describe the PBSMT in section 2.1.
Then section 2.2 gives some basic kinds of movements of phrase in the reordering problem.
And, one of famous lexical reordering model, integrated in the decoding process is seen
in section 2.3. Section 2.4 gives some briefs of method treated the reordering as the pre-
process task in training and decoding, using the transformation rules and syntax tree.
Finally, we would like to introduce the Moses Decoder (Koehn et al., 2007), which is used
to decode and train our models.
2.1 Phrase-based Translation Models

PBSMT (Koehn et al., 2003; Och and Ney, 2004) is extended from word-based SMT
model. In this model, the faithfulness (Pr (f — e)) is extracted from bilingual corpus
by using the alignment model (the IBM model (Brown et al., 1993)) with the contiguous
sequence of words. Koehn et al. (2003); Jurafsky and Martin (2009) describe the general
process of phrase based translation in three steps. First, the words in the source sentence
are grouped into phrases: ¯e
1
, ¯e
2
· · · ¯e
I
. Next we translate each phrase ¯e
i
into the target
phrase
¯
f
j
. Finally, each of target phrase
¯
f
j
is (optionally) reordered. Koehn et al. (2003)
use the IBM Models to build the translation model. Two aligned phrases are the phrase
in which, each word in the source phrase can be aligned with other word in the target
phrase. Specially, no word outside the phrase can be aligned the word in side the phrase
and inverse. And, the probability of phrase translation can be computed by formula
P r(
¯
f|¯e) =

count(
¯
f, ¯e)
count(¯e)
8 Chapter 2. Related works
Some alternative methods to extract phrases and learn phrase translation tables have
proposed (Marcu and Wong, 2002; Venugopal et al., 2003) and compared in Koehn’s
publication (Koehn et al., 2003) .
Another phrase based model, introduced by Och and Ney (2004), use discriminative
model. In this various model, phrase translation probability and others features are
integrated to log linear model following the formula:
P r(f|e) =
exp

i
λ
i
h
i
(e, f)

e

exp λ
i
h
i
(e

, f)

where h
i
(e, f) is a feature function that bases on the source and target sentence. The λ
i
is the weight of feature h
i
(e, f). Och (2003) introduce method to estimate these weights
using Minimum Error Rate Training (MERT). In their studies, they also use some basic
features such as:
• bidirectional phrase models (models that score phrase translation)
• bidirectional lexical models (models that consider the appearance of entries from a
conventional translation lexical in the phrase translation)
• language model of the target language (usually n-gram model or n-th Markov model,
which is trained from mono corpus
1
)
• distortion model
After having the translation model, we need method to decode input sentence to
output sentence. Normally, a kind of beam search or A

method is used. Koehn (2004)
introduced an effective method, called stack decoding. The basic ideal of this method
is using a limited stack of the same length phrase. For an example, in the processing
of translation, when three words are translated, the result phrase will be stored in the
stack and the length of stack is three. The pseudo code of stack decoding is described as
follows:
2.2 Type of orientation phrases
Tillmann (2004); Galley and Manning (2008) give the basic type of reordering phrase in
the sentence. There are three basic types:
• monotone: continuous phrase in the source language is also continuous phrase in

the target language
• swap: the next phrase of one phrase in the target language is the previous phrase
of the aligned with the one in the source language
1
mono corpus is usually same with the training corpus, which is used learn the phrase table or a big
set of target sentence, independent with the corpus to train and test
2.3. The Lexical Reordering Model 9
Algorithm 1 A Stack decoding algorithm, which is used in Phraoh system, to get the
target sentence from source sentence
Require: a sentence, which we would like to translate
Require: a phrase base model
Require: a language model
initialize hypothesisStack[0,nf]
for all i in range(0, nf - 1) do
for all hyp in hypothesisStack[i] do
for all new hyp can be derived from hyp do
nf[new hyp] → number of foreign words covered by new hyp
add new hyp to hypothesisStack[nf [new hyp]]
prune hypothesisStack[nf [new hyp]]
end for
end for
end for
find best hypothesis best hyp in hypothesisStack[nf]
return a best path lead to best hyp via backtrace
• discontinuous: two consequence phrases in the one language are aligned with two
discontinuous phrase in other language
For example, we have a pair of sentences:
en tom ’s two blue books are good .
vn hai cuốn_sách màu_xanh của tôm là tốt .
In this example, are good and là tốt is represented as monotone. With swap instance,

we see two phrase blue books and cuốn_sách màu_xanh. Finally, two blue and hai
màu_xanh is the example of discontinous phrase.
2.2.1 The Distance Based Reordering Model
Koehn et al. (2003) introduced the method of phrase translation model and a simple
distortion model based on the exponent of the penalize α as the formula:
d(e
i
, e
i−1
) = α
f
i
−f
i−1
−1
, so that, the distance of two continuous phrases is based on the difference of the correlative
phrase in the target language. For example, distance of two phrases hai màu_xanh is
d(4, 3) = α
3−1−1
= α
1
2.3 The Lexical Reordering Model
Galley and Manning (2008), based on the log linear model, introduced the new reordering
model as a feature of translation model. The features are parameterized as follows: with
10 Chapter 2. Related works
a source sentence f, a sequence of phrase e = ( ¯e
1
, ¯e
2
· · · ¯e

n
), and phrase alignment model
a = ( ¯a
1
, ¯a
2
· · · ¯a
n
) that defines a source
¯
f
a
i
for each translated phrase ¯e
i
, those models
estimate the probability of a sequence of orientation o = ( ¯o
1
, ¯o
2
· · · ¯o
n
) as follows:
P r(o|e, f) =
n

i=1
p(o
i
|¯e

i
,
¯
f
a
i
)
in which each o
i
takes value over the set of possible orientation ∆ = M, S, D. We can
define three type as:
• o
i
= M if a
i
− a
i−1
= 1
• o
i
= S if a
i
− a
i−1
= −1
• o
i
= D if a
i
− a

i−1
= ±1
At decoding time, they define there feature function such as:
• f
m
=

n
i=1
log p(o
i
= M| )
• f
s
=

n
i=1
log p(o
i
= S| )
• f
d
=

n
i=1
log p(o
i
= D| )

2.4 The Preprocessing Approaches
As mentioned in chapter 1, some approaches using syntactic information are applied
to solve the reordering problem. One of approaches is syntactic parsing of the source
language and reordering rules as preprocessing steps. The main idea is transferring the
source sentences to get very close target sentences in word order as possible, so EM
training is much easier and word alignment quality becomes better. There are several
studies to improve reordering problem such as Xia and McCord (2004); Collins et al.
(2005); Nguyen and Shimazu (2006); Wang et al. (2007); Habash (2007); Xu et al. (2009).
They all performed reordering during preprocessing step based on the source tree
parsing combining either automatic extracted syntactic rules Xia and McCord (2004);
Nguyen and Shimazu (2006); Habash (2007) or manually written rules Collins et al.
(2005); Wang et al. (2007); Xu et al. (2009).
Xu et al. (2009) described method using dependency parse tree and a flexible rule to
perform the reordering of subject, object, etc These rules were written by hand, but
Xu et al. (2009) showed that an automatic rule learner can be used.
Collins et al. (2005) developed a clause detection and used some handwritten rules
to reorder words in the clause. Partly, Xia and McCord (2004); Habash (2007) built an
automatic extracted syntactic rules.
Compared with theses approaches, our work has a few differences. Firstly, we aim
to develop the phrase-based translation model to translate from English to Vietnamese.
Secondly, we build a shallow tree by chunking in recursively (chunk of chunk). Thirdly, we
2.5. Translation Evaluation 11
use not only the automatic rules, but also some handwritten rules, to transform the source
sentence. As Xia and McCord (2004); Habash (2007) did, we also apply preprocessing in
both training and decoding time.
The other approaches use syntactic parsing to provide multiple source sentence re-
ordering options through word (phrase) lattices Zhang et al. (2007); Nguyen et al. (2007).
Nguyen et al. (2007) applied some transformation rules, which is learnt automatically
from bilingual corpus, to reorder some words in a chunk. A crucial difference between
their methods and ours is that they do not perform reordering during training. While,

our method can solve this problem by using a complicated structure: a shallow syntax
tree (chunk of chunks), which is more efficient.
2.5 Translation Evaluation
In the previous sections, we introduced about the method which help computer can trans-
late automatically the sentence from one language to others. Now, the question is how
good the results are. To answer this question, this section will introduced some methods
to evalute MT. In general, there are two ways to evaluate the MT task: evaluate by
human or automatically by building a metric to simulate the human processing.
2.5.1 Automatic Metrics
2.5.1.1 BLEU Scores
Today, BLUE Score is one of the most favorite method to evaluate the MT. This measure
was introduced by IBM (Papineni et al., 2002). This method is one kind of a unigram
precision metric. The simple unigram precision metric was based on the frequency of co-
occurrence word in both output and reference sentence. Similarly, in the BLUE Score, we
also compute the n-gram co-occurrence between the output and the reference translation,
and then taking the weighted geometric mean. So that, we can compute the BLUE Score
as follows:
BLU E
n
= exp(

n
i=1
blue
i
n
+ length penalty)
where blue
i
and length penalty are computed by counting in each sentence pair in whole

test and reference sets. And these parameter is computed follow some function:
blue
n
= log(
Nmatched
n
Ntest
n
)
length penalty = min(0, 1 −
shortest ref length
Ntest
1
)
12 Chapter 2. Related works
And, from each pair of output and reference translation, we can compute the N matched
i
,
Ntest
i
and shortest ref length as follow:
Nmatched
i
=
N

n=1

ngr∈S
min(N(test

n
, ngr), max
r
(N(ref
n,r
, ngr)))
where S is the set of i-gram in the sentence test
n
(n-th sentence in the test set), N(sent, ngr)
is the number of the co-occurrence of the ngr ngram in the sentence sent, N is the number
sentence of the test set and ref
n,r
is the r-th reference of n-th test sentence. For example,
with n-th sentence in the test set, we have R sentence as the reference of this sentence.
So that, r ∈ {1, 2, , R}
Ntest
i
=
N

n=1
length(test
n
) − i + 1
shortest ref length =
N

n=1
min
r

{length(ref
n,r
)}
With some equations, are described above, BLUE score is a quality metric, and the
value is in the range of zero to one, zero means the worst-translation (there is any matching
in the reference and the output translation) and one means the best translation. The
BLUE Score is a kind of precision metric, so that, if the output is short and the sentence
contain word, which is occurred in the reference can achieve high result, but in fact they
are not similar. So that, the duty of length penalty is the recall or coverage weighted.
2.5.2 NIST Scores
NIST evaluation metric, introduced by (Doddington, 2002), is based on the BLUE metric,
but with some difficulties. NIST Score calculates how informative a particular n-gram is,
and the rarer a correct n-gram is, the more weight it will be given. So that, we have a
difficult equation:
NIST
n
= (
n

i=1
nist
i
).nist penalty(
test 1
ref
1
R
)
where nist
n

and nis penalty are computed by referred to each pair of output and ref-
erences. The NIST Score is a quality score from range zero (worst translation) to an
unlimited positive value. In general, it usually is five or twelve.
2.5.3 Other scores
There are another evaluation metrics such as Word Error Rate (mWER), Position Inde-
pendent Error Rate (mPER), etc
2.6. Moses Decoder 13
2.5.4 Human Evaluation Metrics
It is said that this is the most accurate method. To evaluate the translation of a bilingual
sentence, some people who understand both of the language will rate the translation in
many aspects. The weakness of this method is cost and time, it needs many people to
take part in to evaluate the translation. However this method is used in some share task
(?)
In general, the fluency of the translation is used to rate the quality. To measure the
fluency of the translation, we will base on some criteria such as how intelligible, how clear,
how readable, or how natural the MT output is (Jurafsky and Martin, 2009). One method
is give the raters, who evaluate the MT output, a scale and ask them to rate the output.
For an example, the scale can in range from one (totally unintelligible) to five (totally
intelligible), or another aspects of fluency such as clarity, naturalness or style.
Another aspect is the fidelity of the sentence. There are two common aspects of fidelity,
which are measured, are adequacy and informativeness. Adequacy metric determines
whether the output contains the information that exists in the source sentence. It also
uses the scale from one to five (how much of the information in the source was preserved in
the target), for raters to rate. This method is useful if we only have monolingual raters,
who native in one language. The other aspects is informativeness of the translation,
the sufficient information in the MT output to perform some task. This metric can be
measured by the percentage of the correct answer of some questions, answered by the
raters.
2.6 Moses Decoder
Koehn et al. (2007) introduced the Moses Decoder, as a SMT system that allows us to

train automatically translation models for any language pair
2
. Figure 2 gives the details
of Moses Decoder
3
.
Moses Decoder supports many kinds of input such as text, xml format, confusion network
and a lattice. There are many kinds of storing translation model, first is in memory, the
limitation of this is require a machine which big memory. Secondly, it can be store by
using disk base. For academic purpose, Moses Decoder implement many kind of decoding
algorithm such as stack beam decoding (Koehn et al., 2003), Cube Prunning (Chiang,
2007) or another method Specially, they integrate many language model type such as:
• SRI introduced by (Stolcke, 2002), the limitation of this toolkit is the size of mono
corpus, which can be load in the toolkit.
• IRST introduced by (Federico et al., 2008), used quantization method to scale the
language model to big data
2
/>3
This figure is taken from old version of manual of Moses Decoder
14 Chapter 2. Related works
Figure 2: The concept architecture of Moses Decoder
• RandLM introduced by (Talbot and Osborne, 2007), used bloom filter as data struc-
ture to count the word
Finally, Moses Decoder supports the output to text (each line as the target sentence
of source sentence), n-best list for top n highest probability when decoding the source
sentence and search graph, which provides more details about decoding process.
There are three step to build and evaluate the translation system:
• training: using bilingual corpus, a language model, the alignment model is learnt
then the phrase table or rule table, lexical reordering (optional) are extracted
• tuning: using MERT to learn the weight of feature (optimize the model)

• evaluation: using metric to evaluate the translation system (phrase table and weights),
in general the BLEU is used.
Chapter 3
Shallow Processing for SMT
3.1 Our proposal model
In this section, we would like to introduce our method to solve the reordering problem by
preprocessing a sentence in a source language. Figure 3 gives the overview of our method.
Firstly, a shallow syntax of the sentence in the source language will be built. After that,
we apply some transformation rules to the shallow syntax of the source language sentence.
Hence, the structure of the shallow syntax will be changed. Finally, we get a new sentence
in the source language with the order of words which are similar to a target language.
Souce Language Sentence
Building Shallow Syntactic
Applying Transformation Rule
New Source Language Sentence
Figure 3: An overview of preprocess before training and decoding.
For example, in figure 4, the first sentence is a source sentence in English. And the
last sentence is a target sentence in Vietnam. The second sentence is an English sentence
with some connecting line with some words in Vietnamese sentence, we can see that the
order of words in source sentence is changed, and it is similar with the word order in
Vietnamese.
tom ’s two blue books
two books blue ’s tom
hai cuốn_sách màu_xanh của tom
Figure 4: A pair of source and target language
16 Chapter 3. Shallow Processing for SMT
Figure 5 shows the detail of training process. Firstly, we build the shallow syntax of
the sentences in the source language, then the new sources sentences will be retrieved by
applying some transformation rules in the shallow syntax. So that, we have got a new
bilingual corpus. We will use this corpus to train a PBSMT model by using MOSES

Decoder (Koehn et al., 2007).
Source Language Se ntence Target Language Sentence
Building Shallow Syntactic
Applying Transformation Rules
Training Translation Model
Phrase Translation Model
Figure 5: The training process
Besides changing the source sentence in the training process, we also apply preprocessing
when decoding a new source sentence. It is decoded by using the beam search with the
log linear model (Och and Ney, 2004). Furthermore, defining the transformation rules
and applying this to reorder the tree node in the shallow syntax will be described in the
next section.
3.2 The Shallow Syntax
In this section, we describe the shallow syntax tree and how to build it from the source
sentence.
3.2.1 Definition of the shallow syntax
Previously, there are some studies, in which syntactic information is used as tree-to-string
(Quirk et al., 2005) (Liu et al., 2006; Huang et al., 2006; Zhang et al., 2007), string-to-
tree (Galley et al., 2006; Marcu et al., 2006), tree-to-tree or hierarchical PBSMT (Chiang,
2007). In these studies, the full syntax tree or shallow tree is built. Inspiring this idea,
we also built the syntax tree of the sentence, the shallow syntax tree in our study is the
limited height of the full syntax tree. The height of the shallow syntax tree is two.
3.2. The Shallow Syntax 17
Souce Language Sentence
Building Shallow Syntactic
Applying Transformation Rules
Beam Search
e

= arg max

e
M

m=0
λ
m
h
m
(e, f )
Decoding
Target Language Sentence
Language Model (h
1
(e))
Translation Model (h
2
(e, f ))
.
.
.
Figure 6: The decoding process
Figure 7 is an example of the shallow syntax tree. We have an English sentence: tom
’s two blue books are good with POS and function tags such as NP, CD
1
. This example
shows that this tree is not the full parse tree. It means the root of tree is S, and the last
child has got tag JJ, the POS of the word good.
3.2.2 How to build the shallow syntax
Figure 8 represents the process to build the shallow syntax tree. Parsing by chunking
(Sang, 2000; Tsuruoka and Tsujii, 2005; Tsuruoka et al., 2009) is a method which builds

1
We use Penn Treebank Tags Set(Marcus et al., 1993)
S
JJ
good
VB
are
NP
NP
NNS
books
JJ
blue
CD
two
POS
’s
NNP
tom
Figure 7: A shallow syntax tree
18 Chapter 3. Shallow Processing for SMT
S
JJ
good
VB
are
NP
NNS
books
JJ

blue
CD
two
POS
’s
NNP
tom
(a) First step in the building shallow syntax tree
S
JJ
good
VB
are
NP
NP
books
POS
’s
NNP
tom
(b) Second step in the building
shallow syntax tree
Figure 8: The building of the shallow syntax
the syntax tree of sentence by using a recursive of chunking process. Firstly, the input
sentence is chunked by one chunking base to a shallow tree (figure 8a). In fact, this
shallow tree is used in some NLP problems such as Name Entity, Base NP detection,
etc After that, a head word in each chunk is extracted (such as, the word books in
figure 8b). Another chunking model is applied to build the shallow tree. Then we loop
the process until we get the final tree with only root node. So that, we will retrieve the
full syntax tree. However, we stop at the first level of loop, and receive the shallow tree

with maximum height is two (figure 8b). Finally, we will get the shallow syntax tree as
figure 7
3.3 The Transformation Rule
After building the shallow tree, we can use this syntax tree to reorder word in the source
sentence. Changing the order of some words in the source sentence is similar with changing
the order of node in the syntax tree, whose nodes are augmented to include a word and
a POS label. To do that, we apply the transformation rule, which is represented as
(LSH → RHS, RS). In this form, LHS → RHS is an unlexicalized CFG rule and RS is
a reordering sequence. In this rule, LHS is the left hand side symbol, it is usually a POS
label or a function tag in the grammar of the source language. And RHS is the right hand
side of the rule, which is a sequence of symbol in the grammar of the source language.
This is called as unlexicalized rule because, the RHS never contains a word in the source
or target language. Each element in the reordering sequence is the index of the symbol
in the RHS. Suppose, we have a rule (NP → JJN N, 1 0) which will transform the
rule (N P → JJNN) in the source language into the rule (N P → N N JJ) in the target
language. Note that, the reorder sequence will be one of the permutation of n elements,
where each element is the index of the symbol in RHS and n is the length of the RHS
symbol. So that, with a same CFG rule, we will have a number of transformation rules.
3.4. Applying the transformation rule into the shallow syntax tree 19
In this thesis, the transformation rule is written manually or extracted automatically
from bilingual corpus. A set of hand written rule will be provided in appendix A. To
extract the transformation rule from bilingual corpus we use a method of Nguyen and
Shimazu (2006).
3.4 Applying the transformation rule into the shal-
low syntax tree
Algorithm 2 gives the details of how we apply the transformation rule into the shallow
syntax tree. We surf each node in the syntax tree and find the rule, matches structure of
the tree and do the transformation. If no rule is found, we keep the order of the words in
the sentence as the same with the input. For example, we have a pair of phrase:
en: tom ’s two blue books

vn: hai cuốn_sách màu_xanh của tom
Figure 9a indicates the shallow syntax tree of an English sentence as the input of our
preprocess, figure 9b is the result of reordering in base-chunk level. Finally, figure 9c
states the result of reordering process in the shallow syntax tree. Finally, the new English
sentence is two books blue ’s tom, similar with the order of the one in the target language.
Algorithm 2 Apply transformation rule into the shallow syntax tree
Require: a root of the shallow syntax tree
Require: a set of transformation rule
if root is not terminal node then
x is CFG of root node
for all transformation rule do
if x matches transformation rule then
reorder child in this root
break
end if
end for
for all children do
recursive with each child
end for
end if
return a source sentence with the order of words are similar with the target sentence
20 Chapter 3. Shallow Processing for SMT
S
NP
NNS
books
JJ
blue
CD
two

POS
’s
NNP
tom
(a) An input shallow syntax tree
S
NP
JJ
blue
NNS
books
CD
two
POS
’s
NNP
tom
(b) A shallow syntax tree with reorder of
nodes in base-chunk level
S
NNP
tom
POS
’s
NP
JJ
blue
NNS
books
CD

two
(c) A shallow syntax tree with reorder in
overall
Figure 9: The building of the shallow syntax

×