Tải bản đầy đủ (.pdf) (8 trang)

DSpace at VNU: Refining lexical translation training scheme for improving the quality of statistical phrase-based translation

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (705.75 KB, 8 trang )

Refining Lexical Translation Training Scheme for
Improving The Quality of Statistical Phrase-Based
Translation
Cuong Hoang1 , Cuong Anh Le1 , Son Bao Pham1,2
1
Faculty of Information Technology
University of Engineering and Technology
Vietnam National University, Hanoi
2
Information Technology Institute
Vietnam National University, Hanoi
{cuongh.mi10, cuongla, sonpb}@vnu.edu.vn
ABSTRACT
Under word-based alignment, frequent words with consistent translations can be aligned at a high rate of precision.
However, the words that are less frequent or exhibit diverse
translations in training corpora generally do not have statistically significant evidences for confident alignments [7].
In this work, we will focus on proposing a bootstrapping
algorithm to capture those less frequent or exhibit diverse
alignments. Interestingly, we avoid making any explicit assumption concerning with the pair of languages used. As
the result, we take the experimental evaluations on two
phrase-based translation systems: the English-Vietnamese
and English-French translation systems. Experiments point
out a significant “boosting” capacity for the quality in overall
for both these tasks.

1.

INTRODUCTION

Statistical Machine Translation (SMT) is a machine translation approach that sentence translations are generated based
on statistical models whose parameters are derived from the


analysis of parallel sentence pairs in a bilingual corpus. In
SMT, the best performing systems are based in some way on
phrases (or the groups of words). The basic idea of phrasebased translation is to learn to break given source sentence
into phrases, then translate each phrase and finally compose
target sentence from these phrase translations [9, 12].
For a statistical phrase-based translation system, the accuracy of statistical word-based alignment models is heavily
important. In fact, under lexical alignment models (IBM
Models 1-2), the frequent words with a consistent translation usually can be aligned at a high rate of precision. However, for the words that are less frequent or exhibit diverse
translations, in general we do not have statistically significant evidence for a confident alignment.

Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SoICT 2012, August 23-24, 2012, Ha Long, Vietnam.
Copyright 2012 ACM 978-1-4503-1232-5/12/08 ...$10.00.

This problem tends to deeply reduce the translation quality in some important manners. First, the diverse translations of a source word are never be able to recognize by our
original statistical alignment models. This is a quite important aspect. It points out the sense that a pure statistical
IBM alignment model is not sufficient enough to reach the
state-of-the-art of the alignment quality. Second, the bad
quality in lexical translating estimation leads a bad influence
to the quality of using higher fertility-based alignment models [6]. Finally, the phrase extraction could not be able to
generate more diverse translations for each word or phrase.
In our observation, the essence of linguistics is flexible
and diversity. Therefore, we need to capture those diverse
translations as serving for obtaining a more superior in quality. To overcome that problem, some papers report improvements when linguistics knowledge is used [7]. In general, the
linguistics knowledge is mainly served to filter out incorrect

alignments. As the result, it is impossible to be easily applicable for all the language pairs without any adaptation.
Different to the previous methods, in this work we propose
a bootstrapping word alignment algorithm for improving the
modelling of lexical translation. Basically, we found that our
alignment model are better in “capturing” the diverse translations of words and therefore reducing the bad alignments
of rare words by that way. Following to the work from [6], we
also show out a very interesting point. That is, although we
mainly focus on IBM Models 1-2, we found that it is possible to significantly improve the quality of fertility-based
alignment models better.
Consecutively, with the improving of word-based alignment models, we point out that our phrase-based SMT system gains a statistically significant in improving its translation quality. The evaluation of our work is performed on
different tasks of different languages. Without concerning
to linguistics knowledge, we believe our approach will be
applicable to other language pairs.
The rest of this paper is organized as follows: Section II
presents IBM Model 1-2. Section III denotes the problem of
bad alignment for rare words or for the words with diverse
translations. Section IV focuses on our bootstrapping word
alignment algorithm. Section V presents our experimental
evaluations. Finally, conclusion is derived in section VI.

2.

IBM MODELS 1-2

55


Model 1 is a probabilistic generative model within a framework that assumes a source sentence f1J of length J translates as a target sentence eI1 of length I. It is defined as a
particularly simple instance of this framework, by assuming
all possible lengths for f1J (less than some arbitrary upper

bound) have a uniform probability . Let t(fj |ei ) as the
translation probability of fj given ei . The alignment is determined by specifying the values of aj for j from 1 to J. [1]
yields as the follows:
J

P r(f |e) =

(I + 1)J

We consider a foreign word, for example: fj . We dive into
the case, where fj usually co-occurs with a word ek from the
pairs of bilingual sentences from parallel corpus. We take
an assumption that this pair is not a pair of corresponding
lexical translation in linguistics. In our expectation, we hope
the lexical translation probability t(fj |ek ) is always smaller
than the lexical translation probability t(fj |ei ). Similar to
the λei normalization factor, we have the λek normalization
factor of the word ek :
m

I

t(fj |ei )

(1)

j=1 i=0

c(fj |ek )


λek =

(6)

j=1

Unfortunately, if the word ek appears less than the word ei
(ek
ei ) in the training corpus, the λei normalization factor
is usually greater deeply than the value λek normalization
λei ). Therefore, following from
factor of the word ek (λek
the equation (3), the lexical translation probability t(fj |ek )
is deeply greater over than the lexical translation probability t(fj |ei ). Therefore, we cannot “capture” the option for
choosing the correct alignment pair (fj |ei ) as we expected.
J
Similarly, we assume that the word fj is a diverse transt(f
|e
)
j i
·
σ(f1J , fj )
c(fj |ei ; f1J , eI1 ) =
lation of the word ei . Because ei contains a larger N posI
i=0 t(fj |ei ) j=1
sible translation options, the λei gains a greater value, too.
I
Hence, following from the equation (3), the lexical translaσ(eI1 , ej )
(2)
·

tion probability t(fj |ei ) gains a very small value. In this
i=0
case, even when ek is not a rare word, it is very common
that t(fj |ei )
t(fj |ek ). Therefore, we cannot control the
In addition, we set λe as normalization factor and then
diverse
translation
(fj ; ei ) as we expected.
find repeatedly the translating probability between a word
In
our
observations,
we found that these things happen
J
I
fj in f1 given a word ei in e1 as:
too usually. Hence, it quite impacts the quality of wordJ
I
by-word alignment modelling. In the following, we will take
t(fj |ei ) = λ−1
(3)
e c(fj |ei ; f1 , e1 )
an example for showing this problem clearly. From a paralIBM Model 2 is just another simple model which is better
lel corpus that contains 60, 000 parallel sentences (Englishthan Model 1 due to the fact that it addresses the issue of
Vietnamese), Table 1 and Table 2 are the results which dealignment with an explicit model for alignment based on the
rived from the “Viterbi” alignments which were pegged by
positions of the input and output words. We make the same
training IBM Model 2.
assumptions as in Model 1 except we assume P r(aj |aj−1

, f1j−1 , J, e) In this section, we will propose an index, which is entitled
1
depends on j, aj , and J, as well as on I. The equation for esthe Average Number of Best Alignment. Importantly,
timating the probability of a target sentence, given a source
it is the key of this work. Previously, [1] introduce the idea
sentence:
of an alignment between a pair of strings as an object indicating for each word in the French string that word in the
J
I
English string from which it arose. For each specific transP r(f |e) =
t(fj |ei )a(i|j, J, I)
(4)
lation model, with a pair of parallel sentences, we will find
j=1 i=0
the ”best” corresponding words of all the words of source
sentence. Our focus is to find out the correlation between
3. A CAUSE FOR BAD WORD ALIGNMENT
the occurrence of a word and its probability to be chosen
TRANSLATION
as the best alignment pair (that word with its correspondWithout the loss of generality, we assume that there exists
ing “marked” alignment word) by our statistical alignment
a set Fei contains N possible word translations of a word
model.
ei :
In other words, we try to find out the relationship between
the number of occurrence of words and the possibility that
Fei = {f1 , f2 , . . . , fn }.
it could be “pegged” as the best alignment pair. These best
alignment pairs could be usually not accurate. Hence, our
It means that from parallel corpus, the correct alignments

focus tries to reflect the possible error happened. For conof the elements from the set Fei are the lexical pairs (fj : ei )
venience, we call the Average Number of Best Alignment
(j ∈ 1, 2, . . . , n). For each training iteration of word-based
(ANBA) index as the average unit between the total numalignment model, the λei normalization factor is defined as
ber of the times of a group of target words when they were
the sum of all the lexical translation probabilities between
marked as the best word-by-word alignments. A group of
fj and ei :
target words here means that these words have the same
frequency occurrence (Freq column).
n
Hence, ANBA could be defined as the average ratio beλei =
c(fj |ei )
(5)
tween the total number of the times of a group of target
j=1
The parameters of Model 1 for a given pair of languages
are normally estimated using EM [3]. We call the expected
number of times that word ei connects to fj in the pair of
translation (f1J |eI1 ) the count of fj given ei for (fj |ei ) and
denote it by c(fj |ei ; f1J , eI1 ). Following some mathematical
inductions by [1], c(fj |ei ; f1J , eI1 ) could be calculated as follows:

56


words when they were marked as the best word-by-word
alignments when we find the best possible corresponding
word of a source word. A class of words, which contains
all the words which have the same frequency occurrences

in training data. Hence, the ANBA index of a class could
be calculated as the ratio between the number of times the
words belonged to its class were chosen as the best wordby-word corresponding over the number of all the words of
that class.
ANBA is also used to reflect the average number of possible translations of a group of target words. Consider the
ANBA tables for both English and Vietnamese words when
each side was chosen as the target language in translation.
Freq
1
2
3
4
5
6
7
8
9
10

Num. of Words
5874
2350
1444
963
698
604
471
389
301
278


ANBA
3.31
2.76
2.48
2.27
2.13
2.02
1.89
1.85
1.78
1.73

Deviation
1.18
0.29
0.07
0.00
0.01
0.04
0.11
0.14
0.20
0.24
σ = 2.28

Table 1: The ANBA Statistical Table for Vietnamese Words
Freq
1
2

3
4
5
6
7
8
9
10

Num. of Words
11815
4219
2190
1350
932
681
524
412
322
305

ANBA
4.45
3.08
2.69
2.46
2.29
2.19
2.12
2.01

1.92
1.85

Deviation
3.78
0.32
0.03
0.00
0.05
0.10
0.15
0.25
0.34
0.43
σ = 5.46

tunately, IBM Models 1-2 are simple models in sense that
we could implement the training processes of them very fast
when we compare to the complexity for training higher IBM
models. Therefore, our improving idea will focus on improving the training scheme of for gaining a better result.

4.1

Turn back to the set Fei which contains n possible lexical translations of the word ei case, we assume that the
pair (fn ; ei ) appears many times in training corpus, from
the equations (2) and (3), t(fn |ei ) will be obtained a very
high value. Consecutively, the rest lexical translation probabilities t(fj |ei ) with i ∈ {1 . . . (n − 1)} will be obtained some
very small values. It becomes badly when the cardinality
of the set F of a word is large, as expected as the diversity
property of linguistics.

Clearly, t(fj |ei ) be never gained it “satisfied” translation
probability according to them. Therefore, the noisy alignment choosing happens strongly followed to the fact that
the pairs of not kinds of real lexical pair (in linguistics) easily gain higher lexical translation probabilities. To overcome our noisy problem, in the following, we will present a
bootstrapping alignment algorithm for refining the training
scheme for IBM Models 1-2.
In more details, we divide the training scheme of a translation model into N steps, which N is called the smoothing
bootstrapping factor. In other words, N could be also defined as the number of times we re-train our IBM Models.
Without the loss of generality, we assume that the occurrence happening of the word fj is not usually than the occurrence happening of the word fj+1 .
For more convenience, we denoted this assumption as:
c(f1 ) ≤ c(f2 ) ≤ . . . c(fn ). If we could divide our training
work into N consecutive steps, which each of them tries to
separate fn in the case that the lexical translation probability t(fn |ei ) is the maximum translation probability at that
time when we compare to other target words ek .
Hence, if we mark and filter out every “marked” pairs of
fn and ei in bilingual corpus when they are chosen as the
best lexical alignment at that time, we have new training
corpus with a set of new updated “parallel sentences”. If
we re-train our new training parallel corpus again, for each
˜e
training iteration, we have a new normalization factor λ
i
of the word ei :

Table 2: The ANBA Statistical Table for English
Words
Table 1 and 2 denote clearly the fact that we can only be
able to “capture” the translations of a word when that word
appears not many times (the diverse translation problem).
In addition, the more one word appears, the more difficulties
for capturing its diverse target translations. To besides, we

see that the average “ANBA” index strongly goes far away
corresponding to the smaller frequency of a word. When one
word appears rarely than the others in training data, it is
usually chosen as the best “Viterbi” alignment pair of source
word as a target word.

4.

IMPROVING LEXICAL TRANSLATION
MODEL

The problem of rare words or words which have a lot of
diverse translations takes us an interesting challenge. For-

Improving Lexical Translation Model

n−1

˜e =
λ
i

c(fj |ei )

(7)

j=1

It is very interesting that each time when we separate fn
out of the set F, it is obviously that the new lexical translation probability t˜(fk |ei ) (k is different to n) will be increased

as it is expected to be increased because the decreasing value
˜ e when we compare to the original normalization factor
of λ
i
λei . That result comes from the fact that there is no need
˜ e any more. Therefore,
for adding a value of c(fn |ei ) to λ
i
the possibility of the source lexical word fj will be automatically aligned to ei is increased as we expected, too. This is
the main key for us to be able to capture the other “diverse”
translations of a word.

4.2

The Bootstrapping Word Alignment Algorithm

From the above analysis, we have a simple but very effec-

57


tive way to improve the quality of lexical translation modelling. In this section, we will formalize our bootstrapping
word alignment algorithm. Put the set:
N = {∆1 , ∆2 , . . . , ∆n }(∆1 < ∆2 < · · · < ∆n )
as the set which could be defined as the covering range of
our bootstrapping word alignment algorithm. The ∆i value
could be understood as the occurrence threshold (or frequency threshold) for separating a group of words according
to their group’s level of the number of occurrences.
The bootstrapping word alignment algorithm is formally
described as follows:

The Bootstrapping Word Alignment Algorithm
S
Input: eS
1 , f1
N = {∆1 , ∆2 , . . . , ∆n }(∆1 < ∆2 < · · · < ∆n )
Output: Alignment A
1. Start with a = φ.
2. For each ∆n as a threshold:
Count frequency of each word in eS
1.
Train IBM Model.
For each pair (f (s) , e(s) ), 1 ≤ s ≤ S
For each ei in e(s) )
If ei marked in A
continue.
If fj in f (s) marked in A
continue.
Finding best fj in f (s)
If c(fi ) ≥ ∆n
Mark (fj ; ei ) as a pair of word
Add (fj ; ei ) to A.
Changing ei in e(s) .
n=n-1
Goto 2.

For the word fj which is the best alignment of a source
word ei and the number of the occurrences of fj is over a
threshold condition, we mark and add (fj ; ei ) to the set of
alignment A. After that, we need to mark ei as “pegged”.
Similar to [1], we change the word ei by adding a prefix

“UNK” to it. The changing word ei as noticed previously,
serving for “boosting” the probability for obtaining other
possible translation fk of ei in other sentences (by reducing
the adding of the translation probability t(fj |ei ) value to the
total λei value since we change ei in e(s) by adding prefix to
change).
In fact, the training of lexical translation models does not
cost deeply our computing power. However, to re-train system by n times, which n is the cardinality of the set N ,
actually costs deeply the computational resource. We have
an upgraded version of our bootstrapping word alignment
algorithm. That is after pegging, for example, (fj ; ei ) as the
best alignment which satisfies the count frequency condition,
we let them out of our training data and re-train the system
with corresponding to our new threshold ∆i . This reduces
the computational resource deeply and also reduces deeply
time for processing.
Together, we will improve the way of choosing each element ∆i in the set N for helping us to not only cover a larger
range of ∆n , but also reduce the computational resource. In

more details, these improving schemes will be described in
the experimental section.

5.

EXPERIMENT

Recently researches point out that it is difficult to achieve
heavily gains in translation performance based on improving word-based alignment results. Word alignment quality
[4] improving could be strong but it is hard to improve the
quality statistical phrase-based translation system in overall [2, 14]. Therefore, to confirm the influences of the improvements from lexical translation, we will test directly the

impact of using word alignment extracting component for
”learning” phrases in phrase-based SMT system.
This experiment is deployed on English-Vietnamese languages. It is also deployed on English-French languages for a
larger training data. The English-Vietnamese training data
was credited by [5]. The English-French training corpus was
Hansards corpus [10]. We use MOSES framework [8] as the
phrase-based SMT framework. In all evaluations, we translate sentences from Vietnamese to English. Then, we measure performance using BLEU metric [13], which estimates
the accuracy of translation output with respect to a reference translation. We use 1, 000 pair of parallel sentences
for testing the translation quality of statistical phrase-based
translation system.

5.1

Baseline Results

According to this work, we use LGIZA1 as a lightweight
statistical machine translation toolkit that is used to train
IBM Models 1-3. More information about LGIZA could be
referred from [6]. Different to GIZA++, LGIZA is originally
implemented based on the original IBM Models documentary [1] without applying other latter improved techniques
which are integrated to GIZA++. These are determining
word classes for giving a low translation lexicon perplexity (Och, 1999), various smoothing techniques for fertility,
distortion or alignment parameters, symmetrization [10][11],
etc. which are applied in GIZA++ [11] and therefore, the
applying of your improved techniques could make our results
a little bit noise in comparison.
Table 1 presents the testing results with BLEU score measurement with each specific IBM Model which was trained
on a bilingual corpus with 60, 000 parallel sentences (0.65
million of tokens). There is a little bit noticed here - that
is for English and Vietnamese - as an example of a pair of

quite different in linguistics, the comparison results of using
IBM Model 3 is usually not good as IBM Model 2.
IBM
IBM
IBM
IBM

Model
Model 1
Model 2
Model 3

BLEU(%)
19.07
19.54
18.70

Table 3: Using IBM Models as the baseline.

5.2

Evaluation On The Bootstrapping Word
Alignment Algorithm

These followings experimental evaluations aim to focus
on seeing the impact by applying our bootstrapping word
alignment algorithm. For each evaluation, for the same set
1

LGIZA is available on: />

58


N served as the covering range, we take different smoothing
factors as the difference between two thresholds ∆i and ∆i+1
(we assume it is a constant). Table 4-5-6-7 point out clearly
the results when we apply our bootstrapping word alignment
algorithm for each specific smoothing factor and for each set
N.
N
40
80
120
160

Smooth
1
1
1
1

BLEU Score
19.54
19.55
19.71
19.57


+0.47
+0.48

+0.64
+0.50

Table 4: Improving results when setting the smoothing factor value is 1

N
40
80
120
160

Smooth
2
2
2
2

BLEU Score
19.49
19.55
19.66
19.63


+0.42
+0.48
+0.59
+0.56

Table 5: Improving results when setting the smoothing factor value is 2


N
40
80
120
160

Smooth
4
4
4
4

BLEU Score
19.41
19.38
19.53
19.53


+0.34
+0.31
+0.46
+0.46

Table 6: Improving results when setting the smoothing factor value is 4
Clearly, the lesson from our evaluation is that it could
be gained better quality of statistical machine translation
using our training smoothing scheme. Usually, the more
refined smoothing factor (smaller value) together with the

large range we apply the smoothing (N ) help us to obtain
better the translation quality. In fact, sometimes when we
choose the size of the set N is too large, it is not sure that
we will obtain a better result (for example - 160 vs 120).
This comes from the fact that for the words which appear more times, the probability that it was “pegged” by a
wrong alignment pair is smaller than the probability of a
word which appears less times. Therefore, we do not need
to use a constant smoothing factor for the same treating to
different occurrence level of words.

5.3

N
40
80
120
160

Smooth
8
8
8
8

BLEU Score
19.39
19.40
19.48
19.38



+0.32
+0.33
+0.41
+0.31

Table 7: Improving results when setting the smoothing factor value is 8
smooth factors. That is we do not choose the same smoothing factor for all each bootstrapping iterations. Alternatively, we take the smoothing factors between each ∆i and
∆i+1 as a consecutively sequence:
S : {0, 1, 2, 3, 4, 5, ..., n − 1}.
and we get the corresponding set N :
N : {0, 1, 3, 6, 10, 15, ..., ∆n }.
This comes from the fact as well-described by the English
and Vietnamese words statistical Tables. We could see it
clearly that for the words which have smaller frequencies,
the more word is rarely happens, the more over-fitting when
we test its ANBA index. Otherwise, it is slightly better for
these other words which are happened more often.
We integrate our improving in choosing the set N corresponding to the set S to our upgraded version of the bootstrapping word alignment algorithm to have a better algorithm. We found that our improving algorithm could also
improving the similar results. In more details, Table 8 and
9 presents how Model 1 and Model 2 could be improved (denoted by the increasing BLEU score measurement) by using
our refining scheme.
In fact, with our upgraded version, the improving in choosing the set N helps us not only need much less computational
power but also cover a larger range of the classes of the occurrences of words. In practice, we found that we only need
around 5 times for running the bootstrapping word alignment algorithm for IBM Models 1-2 when we compare to
apply our original implementation.
Size of S
14
15
16

17
18

BLEU Score
19.45
19.44
19.71
19.47
19.53


+0.39
+0.38
+0.65
+0.41
+0.47

Table 8: Evaluation on the refining scheme - Improving IBM Model 1

Evaluation On The Upgraded Version

The above experimental evaluations on our bootstrapping
word alignment algorithm clearly helps us to improve better
translation quality of IBM Model 1. In almost cases, we also
be able to boost the quality of translation models around
0.5% BLEU Score. However, if we re-train the quality too
much time, it took too much cost. Similarly, if we use a
constant smoothing factor for the same treating to different
occurrence level of words, it is not good as we described
above.

In this section, we will improve our bootstrapping word
alignment algorithm by improving the way of choosing the

Size of S
14
15
16
17
18

BLEU Score
20.21
20.12
20.01
20.02
19.88


+0.67
+0.58
+0.47
+0.48
+0.34

Table 9: Evaluation on the refining scheme - Improving IBM Model 2

59


5.4


Evaluation On Improving Fertility Model

In this section, we will go to one of the most interesting
thing which we want to emphasize. We will present the
boosting translation quality to fertility models based on the
improving from lexical translation modelling. Actually, the
improving 0.5% BLEU score measurement is not too much
important. The most important thing is that by improving
lexical translation modelling, higher fertility models could
boost their translation quality around 1% in BLEU score
when we compare to the original implementation.
To reason the boosting capacity of using higher translation models, at first we consider new obtaining results about
the words statistical tables when we apply our bootstrapping
word alignment algorithm. These “ANBA” index tables derived from using the test with N with it cardinality is 14 in
the evaluation in the Table 9. We could see it clearly that the
ANBA index is reduced quite divergence strongly as we let
them out previously when we compare to the results which
we train by our original method.
Freq
1
2
3
4
5
6
7
8
9
10


Num. of Words
5874
2350
1444
963
698
604
471
389
301
278

ANBA
2.07
1.59
1.46
1.35
1.27
1.23
1.21
1.21
1.19
1.11

Deviation
0.49
0.05
0.01
0.00

0.01
0.02
0.03
0.03
0.03
0.07
σ = 0.73

Table 10: The Upgraded ANBA Statistical Table for
Vietnamese Words

Freq
1
2
3
4
5
6
7
8
9
10

Num. of Words
11815
4219
2190
1350
932
681

524
412
322
305

ANBA
3.63
2.18
1.93
1.82
1.72
1.63
1.63
1.56
1.56
1.48

Deviation
2.94
0.07
0.00
0.01
0.03
0.08
0.08
0.13
0.13
0.19
σ = 3.66


Table 11: The Upgraded ANBA Statistical Table for
English Words
Thinking it likes we have a better balance in ANBA indices. This will let us have a better initial translation parameters using for the initial parameters for training fertility models. In addition to the better in the initial fertility
transferring parameters, more important, there is no “trick”
for helping us to train IBM Model 3 in a very fast way
which could “care” all the possible alignments for each pair
of parallel sentences in our training data for finding the best
“Viterbi” alignments. Our strategy is to carry out the sums

of translations only over some of the more probable alignments, ignoring the vast sea of much less probable ones.
Since we begin with the most probable alignment that we
can find and then include all alignments that can be obtained from it by small changes, the improving results in
lexical modelling is very important.
Based on the original equation by [9], following to our better “Viterbi” alignments from IBM Model 2, we will have a
new translation probabilities and position alignment probabilities derived from our “Viterbi” alignment sequences obtaining from applying the bootstrapping word alignment algorithm for training IBM Models 1-2:
t(f |e) =

a(i|j, le, lf ) =

count(f, e)
f count(f, e)

(8)

count(i, j, le, lf )
i count(i, j, le, lf )

(9)

Hence, we have new initial translation probabilities and

new initial alignment probabilities. By using them as the
initial parameter transferring to the training IBM Model
3, we will show that we could obtain a better translation
probabilities, as described in the Table 12:
Size of S
14
15
16
17
18

BLEU Score
19.76
19.58
19.81
19.68
19.56


+1.06
+0.88
+1.11
+0.98
+0.86

Table 12: Improving IBM Model 3
It is clearly from the above evaluation results that by improving IBM Model 1 and 2 translation scheme, we could
obtain a better SMT system with increasing BLEU Score
around 1% measurement. Another interesting thing is, in
almost case, we always gain a better results, which are almost around 1% BLEU score improving measurement.

To besides, the more appealing thing is, we could see that
by applying our improving, we do not change deeply the
monotonic of the ANBA index. It means that we slightly
change them but the improving result is quite impressed. It
makes us a strong believe that in the future, by applying
other better improving translation scheme, and together,
changing more deeply the monotonic of the ANBA index,
we will be able to obtain a more strong impress better SMT
system quality.

5.5

Evaluation On Larger Training Data

To see our improving could be able to apply to other pairs
of languages, in this section, we will deploy our improving
training scheme in the pair of English-French language. This
also helps us for testing the influences in larger training data.
We apply the improved bootstrapping algorithm on our
training data which contains 100, 000 English - French parallel sentences (4 million of tokens). The original ANBA
indices Tables when we running the original IBM Model 1-2
(the BLEU Score is 24.01) is denoted in Table 13 and 14.
We choose the size of the set S is 14. After applying our
Bootstrapping Word Alignment Algorithm, we have a better
SMT system with BLEU Score is 25.17, and a better ANBA

60


Freq

1
2
3
4
5
6
7
8
9
10

Num. of Words
12230
3880
2418
1606
1140
858
670
483
432
402

ANBA
3.72
2.80
2.46
2.16
1.95
1.89

1.73
1.65
1.55
1.50

Deviation
2.49
0.43
0.10
0.00
0.04
0.06
0.17
0.24
0.35
0.41
σ = 4.30

Table 13: The ANBA Statistical Table for French
Words
Freq
1
2
3
4
5
6
7
8
9

10

Num. of Words
7910
3146
1752
1223
898
676
540
463
390
339

ANBA
4.43
3.12
2.63
2.50
2.29
2.21
2.10
1.98
1.89
1.80

Deviation
3.74
0.39
0.02

0.00
0.04
0.08
0.16
0.27
0.37
0.48
σ = 5.55

Table 14: The ANBA Statistical Table for English
Words
indices, as denoted in Table 15 and 16. Finally, the boosting
of fertility models denoted in Table 17.
Freq
1
2
3
4
5
6
7
8
9
10

Num. of Words
12230
3880
2418
1606

1140
858
670
483
432
402

ANBA
2.48
1.96
1.81
1.66
1.56
1.58
1.50
1.43
1.40
1.40

Deviation
0.64
0.08
0.02
0.00
0.01
0.01
0.03
0.06
0.08
0.08

σ = 1.01

Table 15: The Upgraded ANBA Statistical Table for
French Words
The original BLEU score measurement of IBM Model 3 for
this training data is 24.33. Our improving got a BLEU score
26.05 and it’s a significantly better result compared to the
original result. It seems like our bootstrapping algorithm is
improved the quality of statistical phrase-based translation
system better for the larger training data.

6.

CONCLUSION AND FUTURE WORK

Under a word-based approach, frequent words with a consistent translation can be aligned at a high rate of preci-

Freq
1
2
3
4
5
6
7
8
9
10

Num. of Words

7910
3146
1752
1223
898
676
540
463
390
339

ANBA
2.95
2.15
2.07
1.87
1.76
1.72
1.67
1.65
1.57
1.59

Deviation
1.10
0.06
0.02
0.00
0.02
0.03

0.05
0.06
0.11
0.10
σ = 1.57

Table 16: The Upgraded ANBA Statistical Table for
English Words

Baseline
Bootstrapping

BLEU Score
24.33
26.05


+1.72

Table 17: Improving IBM Model 3

sion. However, words that are less frequent or exhibit diverse
translations do not have statistically significant evidence for
confident alignment, thereby leading to incomplete or incorrect alignments. We have already presented this aspect
based on the proposed index ANBA. We have also pointed
out the fact that using a pure statistical IBM translation
model is not enough to reach the state-of-the-art in word
alignment modelling.
To overcome this problem, we present an effective bootstrapping word alignment algorithm. In addition, we found
that there are also a lot of effected methods which are quite

simple and easy for boosting the quality of IBM Models 1-2
and hence improving the machine translation quality in overall. Following to this scheme and other methods, we believe
that these improvements will improve the fertility models
to be obtained the state-of-the-art of statistical alignment
quality.

7.

ACKNOWLEDGEMENT

This work is partially supported by the Vietnam’s National Foundation for Science and Technology Development
(NAFOSTED), project code 102.99.35.09. This work is also
partially supported by the project KC.01.TN04/11-15.

8.

REFERENCES

[1] P. F. Brown, V. J. D. Pietra, S. A. D. Pietra, and
R. L. Mercer. The mathematics of statistical machine
translation: parameter estimation. Comput. Linguist.,
19:263–311, June 1993.
[2] C. Callison-Burch, D. Talbot, and M. Osborne.
Statistical machine translation with word- and
sentence-aligned parallel corpora. In Proceedings of the
42nd Annual Meeting on Association for
Computational Linguistics, ACL ’04, Stroudsburg, PA,
USA, 2004. Association for Computational Linguistics.
[3] A. P. Dempster, N. M. Laird, and D. B. Rubin.
Maximum likelihood from incomplete data via the em

algorithm. JOURNAL OF THE ROYAL

61


[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

STATISTICAL SOCIETY, SERIES B, 39(1):1–38,
1977.
A. Fraser and D. Marcu. Measuring word alignment

quality for statistical machine translation. Comput.
Linguist., 33:293–303, Sept. 2007.
C. Hoang, A. Le, P. Nguyen, and T. Ho. Exploiting
non-parallel corpora for statistical machine
translation. In Proceedings of The 9th IEEE-RIVF
International Conference on Computing and
Communication Technologies, pages 97 – 102. IEEE
Computer Society, 2012.
C. Hoang, A. Le, and B. Pham. A systematic
comparison of various statistical alignment models for
statistical english-vietnamese phrase-based translation
(to appear). In Proceedings of The 4th International
Conference on Knowledge and Systems Engineering.
IEEE Computer Society, 2012.
S. J. Ker and J. S. Chang. A class-based approach to
word alignment. Comput. Linguist., 23:313–343, June
1997.
P. Koehn, H. Hoang, A. Birch, C. Callison-Burch,
M. Federico, N. Bertoldi, B. Cowan, W. Shen,
C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin,
and E. Herbst. Moses: open source toolkit for
statistical machine translation. In Proceedings of the
45th Annual Meeting of the ACL on Interactive Poster
and Demonstration Sessions, ACL ’07, pages 177–180,
Stroudsburg, PA, USA, 2007. Association for
Computational Linguistics.
P. Koehn, F. J. Och, and D. Marcu. Statistical
phrase-based translation. In Proceedings of the 2003
Conference of the North American Chapter of the
Association for Computational Linguistics on Human

Language Technology - Volume 1, NAACL ’03, pages
48–54, Stroudsburg, PA, USA, 2003. Association for
Computational Linguistics.
F. J. Och and H. Ney. Improved statistical alignment
models. In Proceedings of the 38th Annual Meeting on
Association for Computational Linguistics, ACL ’00,
pages 440–447, Stroudsburg, PA, USA, 2000.
Association for Computational Linguistics.
F. J. Och and H. Ney. A systematic comparison of
various statistical alignment models. Comput.
Linguist., 29:19–51, March 2003.
F. J. Och and H. Ney. The alignment template
approach to statistical machine translation. Comput.
Linguist., 30:19–51, June 2004.
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu.
Bleu: a method for automatic evaluation of machine
translation. In Proceedings of the 40th Annual Meeting
on Association for Computational Linguistics, ACL
’02, pages 311–318, Stroudsburg, PA, USA, 2002.
Association for Computational Linguistics.
D. Vilar, M. Popovi´c, and H. Ney. AER: Do we need
to ”improve” our alignments? In International
Workshop on Spoken Language Translation, pages
205–212, Kyoto, Japan, Nov. 2006.

62




×