Tải bản đầy đủ (.pdf) (4 trang)

DSpace at VNU: Improving the quality of word alignment by integrating Pearson's Chi-square test information

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (203.49 KB, 4 trang )

2012 International Conference on Asian Language Processing

Improving the Quality of Word Alignment By Integrating Pearson’s Chi-square Test
Information
Cuong Hoang1 , Cuong Anh Le1 , Son Bao Pham1,2
1 University of Engineering and Technology
Vietnam National University, Hanoi
2 Information Technology Institute
Vietnam National University, Hanoi
{cuongh.mi10, cuongla, sonpb}@vnu.edu.vn
Abstract—Previous researches mainly focus on the approaches which are essentially inspirited from the log-linear
model background in machine learning or other adaptations.
However, not a lot of studies deeply focus on improving wordalignment models to enhance the quality of phrase translation
table. This research will follow on that approach. The experiments show that this scheme could also improve the quality of
the word-alignment component better. Hence, the improvement
impacts the quality of translation system in overall around 1%
for the BLEU score metric.

focuses on the aspect that the lexical translation modelling
improving grants us to “boost” the “hidden” merit of IBM
higher alignment models and therefore improves the quality
of statistical phrase-based translation better1 . We found that
this is a quite important aspect. However, there are just only
some works deeply concentrate on that point [8].

Keywords-Machine Translation, Pearson’s Chi-Square Test,
Word-Alignment Model, Log-Linear Model

A. IBM Model 1
Model 1 is a probabilistic generative model within a
framework that assumes a source sentence f1J of length J


translates as a target sentence eI1 of length I. It is defined as a
particularly simple instance of this framework, by assuming
all possible lengths for f1J (less than some arbitrary upper
bound) have a uniform probability . [9] yields the following
summarization equation:

II. IBM M ODELS AND T HE E FFECTS OF THEM TO
HIGHER ALIGNMENT MODELS

I. I NTRODUCTION
Modern Statistical Machine Translation (SMT) systems
are usually built from log-linear models [1][2]. In addition,
the best performing systems are based in some way on
phrases (or the groups of words) [2][3]. The basic idea of
phrase-based translation is to learn to break given source
sentence into phrases, then translate each phrase and finally
compose target sentence from these phrase translations.
The step of phrase learning in a statistical phrase-based
translation system usually relies on the alignments between
words. For finding the best alignments between phrases, first
we generate word alignments. Phrase alignments are then
heuristically “extracted” from them [3]. In fact, previous
experiments point out that automatic word-based alignment
process is a vital component of an SMT system [3].
There are a lot of studies, which are inspirited from the
log-linear model background in machine learning or other
adaptations[4][5][6], to focus on improving the quality of
the translation system. However, not many works scrutinize
to enhance the quality of word-alignment component to
improve the accuracy of phrase translation table and hence

to enhance the system in overall [7].
Follow the second scheme, in this paper, we focus on
improving the accurateness of word alignment modelling.
Indeed, with the accuracy enhancing from results yielded
by improving the quality of the word translation modelling,
we found that the quality of phrase-based SMT system could
obtain a quite good improvement. To besides, this research
978-0-7695-4886-9/12 $26.00 © 2012 IEEE
DOI 10.1109/IALP.2012.44

J

P r(f |e) =

(I + 1)

J

I

t(fj |ei )

(1)

j=1 i=0

The parameters of Model 1 for a given pair of languages
are normally estimated using EM [10]. We call the expected
number of times that word ei connects to fj in the pair
of translation (f1J |eI1 ) the count of fj given ei for (fj |ei )

and denote it by c(fj |ei ; f1J , eI1 ). Follow some mathematical
inductions by [9], c(fj |ei ; f1J , eI1 ) could be calculated as
follow equation:
c(fj |ei ; f1J , eI1 )

t(fj |ei )

=

I
i=0
I

·

t(fj |ei )

J

·

σ(f1J , fj )

j=1

σ(eI1 , ej )

(2)

i=0


In addition, we set λe as normalization factor and then
find repeatedly the translating probability between a word
fj in f1J given a word ei in eI1 as:
J I
t(fj |ei ) = λ−1
e c(fj |ei ; f1 , e1 )

(3)

1 For convenience, we use the “lexical models” term to imply IBM Model
1-2 and “higher models” term to refer to IBM Model 3-5.

121


as it could be. The translation probability t(fk |ei ) is small
for two reasons. First, it is not sure that the two words are
co-occurred many times. Second, it comes from the impact
of “noisy” probability t(fj |ei ). This is quite important. In
addition, [8] points out that t(fk |ei ) will be usually smaller
than t(fj |ei ) when fj is just less occurrence than fk .
In order to abate the “wrong” translation probabilities
t(fj |ei ), following to [8], the purely statistical method based
on the co-occurrence is quite hard to achieve that goal.
Previous works mainly focus on integrating syntactic knowledge to improve the quality of alignment [11]. In this work,
we will propose a new approach, that is we combine the
traditional IBM models with the another statistical method
- the Person’s Chi-square test information.


B. The Effect of IBM Model 1 to Higher Models
Simply, for Model 2, we make the same assumptions as in
j−1
, J, e) depends
Model 1 except we assume P r(aj |aj−1
1 , f1
on j, aj , and J, as well as on I. The equation gives the
Model 2 estimate for the probability of a target sentence,
given a source sentence:
J

I

t(fj |ei )a(i|j, J, I)

P r(f |e) =

(4)

j=1 i=0

IBM Model 3-5, yield more accurate results than Model 12 mainly based on fertility-based scheme. However, consider
the original and general equation for IBM Model 3-5, which
is proposed by [9], as described the “joint likelihood” for a
tableau, τ , and a permutation, π, is:
I

P r(τ, π|e)

=


I
P r(φi |φi−1
1 , e)P r(φ0 |φ1 , e)

i=1
I φi
i=0 k=1
I φi
i=1 k=1
φ0
k=1

B. Adding Mutual Information
In fact, the essence of Pearson’s Chi-square test is to
compare the observed frequencies in a table with the
frequencies expected for independence. If the difference
between observed and expected frequencies is large, then
we can reject the null hypothesis of independence. In the
simplest case, the X 2 test is applied to 2-by-2 tables. The X 2
statistic sums the differences between observed and expected
values in all squares of the table, scaled by the magnitude
of the expected values, as follows:

P r(τik |τik−1 , τ0i−1 , φI0 , e)
1

P r(πik |πik−1 , π1i−1 , τ0I , φI0 , e)
1


P r(π0k |π0k−1 , π1I , τ0I , φI0 , e)
1

(5)

X2 =
i,j

We could see that the alignment position information
and the fertility information are the great ideas. However,
the problem is quite intuitive here: for Model 1 and 2,
we restrict ourselves to the type of alignments which are
the “lexical” connection - each “cept” connection is either
a single source word or it could be an NULL word. In
contrast, as in the equation (6) and (7), we use these probabilities as initial result for parameterizing φ, calculating
j−1
, J, e) and other translation probabilities.
P r(aj |aj−1
1 , f1
This fact reduces deeply the value of these higher models
because the lexical translation probabilities are quite errorprone.

(Oij − Eij )
Eij

2

(6)

where ranges over rows of the table, sometimes it is called

“contingency tables”, ranges over columns, Oij is the observed value for cell (i, j) and Eij is the expected value. [12]
realized that it seems to be a particularly good choice for
using the “independence” information. Actually they used a
measure of judge which they call φ2 , which is a X 2 -like
statistic. The value of φ2 is bounded between 0 and 1.
For more detail on the tutorial to calculate φ2 , please
refer to [12]. Actually, the performance for identifying word
correspondences by using φ2 method is not good as using
IBM Model 1 together with EM training scheme [13].
However, we believe this information is quite valuable. We
adjust the equation for each IBM model to gain better
accurate in lexical translation results. In more detail, for
convenient, let ϕij is denoted the probability φ2 (ei , fj ). The
probability c(f |e; f1J , aJ1 , eI1 ) could be calculated as follow
equation:

III. I MPROVING IBM M ODELS
A. The Problem
Basically, in order to estimate the word translation
t(ei |fj ), we consider the translation probabilities of all of
the possible equivalence words ek (k = i) in eI1 of fj .
In more detail, our focus is on the case in which two
words fj and ei co-occurrence many times. The lexical
translation probability t(fj |ei ) could be derived a high value.
However, the two words fj and ei are actually not existed
any “meaning relationship” in linguistics. These are just
appearing many times in more by “chance”. This case lets
us an important following result.
That is, the “corrected” translation words of ei , for an
example, fk is never gain an expected translation probability


c(f |e; f, e)

t(fj |ei ) · (λ + (1 − λ) · ϕij )

=
J

I
i=0

t(fj |ei ) · (λ + (1 − λ) · ϕij )
I

σ(f, fj )
j=1

σ(e, ej )

(7)

i=0

The parameter λ defines the weight of the Pearson’s Chisquare test information. Good values for this parameter are
around 0.3.

122


IV. E XPERIMENT


1) Improving Word Alignment Quality: From the Table
1 and 2, we could see that adding the Pearson’s Chisquare improves our systems better in quality. It boosts
the performance when we use IBM Model 1 as the wordalignment component around 0.51% for the pair E-V and
0.44% for the pair E-F. Also, we have the similar improving
results for other IBM models.
There are two exciting things we found here. First, we
obtain a better improving performance for the fertility-based
alignment models (0.84% for the first pair and 0.61% for
the second pair). Second, we obtain a better improving
performance for the pair E-V than the second pair. This
is logical with the previous work from [15], in which they
point out that modelling the word-alignment for the pair
which is quite different in grammar such as the pair E-V
is quite difficult than the other cases. It means for that pair,
the t translation parameter is quite less accurate.
2) The total performance: This experimental evaluation
focuses on the another aspect which we concentrate on. That
is, for the training scheme - for example: 15 23 33 for the
pair E-V or 15 25 33 for the pair E-F, we scrutinize that if we
simultaneously improve not only Model 1-2 but also Model
3, is the obtaining improvement better than the case that
we only try to simply improve Model 3. The experimental
result is quite impressive (1.24% vs 0.84% and 1.14% vs
0.61%). This steady confirms our analysis in the section 2.
The improving of lexical translation modelling grants us to
“boost” all of “hidden” power of IBM higher models and
therefore improving the quality of statistical phrase-based
translation deeply.


A. The Preparation
This experiment is deployed on two pairs of languages:
English-Vietnamese (E-V) and English-French (E-F) to have
an accurate and reliable result. The 60.000 E-V training data
was credited by [14]. Similarly, the 60.000 E-F training corpus was the Hansards corpus [13]. In this work, we directly
test our improving method to the phrase-based translation
system in overall. We learn phrase alignments from a corpus
that has been word-aligned by the alignment toolkit. Our
phrase learning component used the best “Viterbi” sequences
which are trained from each IBM model. We use LGIZA2
as a lightweight statistical machine translation toolkit that is
used to train IBM Models 1-3. More information on LGIZA
could be referred to [15].
We deploy the training scheme on the training data is
15 23 33 for the pair E-V. This scheme is suggested for
that pair by [15] for the best performance. We also deploy
the training scheme 15 25 33 , which was suggested by [13],
for the pair E-F. In addition, we use MOSES [16] as the
phrase-based SMT framework. We measure performance
using BLEU metric [17], which estimates the accuracy of
translation output with respect to a reference translation.
B. The “boosting” performance
We concentrate our evaluations on two aspects. The first
one is the “boosting” performance on each word alignment
model. The last one is the “boosting” performance of using
higher alignment model based on the improvement of itself
plus the improvement from the lower alignment models.
Table 1 describes the “boosting” results of our proposed
methods for the E-V translation system. Similarly, Table 2
describes the improvements for the pair E-F.


V. C ONCLUSION
Phrase-based models represent the current state-of-the-art
in statistical machine translation. Together, the step of learning phrase in statistical phrase-based translation is absolutely
important. This work focuses on an approach in which we
integrate the Pearson’s Chi-square test information to IBM
Model for obtaining a better performance. We directly test
our improving to the overall system to have a better accuracy.
To besides, we also point out the fact that the improving
of lexical translation modelling grants us to “boost” all of
“hidden” power of IBM higher models and therefore improving the quality of statistical phrase-based translation deeply.
In summary, we believe attacking lexical translation is a
good way for improving statistical phrase-based translation
in overall quality. Hence, the quality of statistical phrasebased systems with the improving in phrase learning together
with integrating linguistics information could tend closely to
state-of-the-art.

IBM Model
Baseline
BLEU
Delta
Model 1
19.00
19.51
0.51
Model 2
19.58
20.05
0.47
Model 3

18.88
19.72
0.84
Model 2 (+Improved Model 1)
19.58
20.31
0.73
Model 3 (+Improved Model 1-2)
18.88
20.12
1.24
Table I
T HE “ BOOSTING ” PERFORMANCE FOR THE E NGLISH -V IETNAMESE
TRANSLATION SYSTEM .

IBM Model
Baseline
BLEU
Delta
Model 1
25.75
26.19
0.44
Model 2
26.51
26.79
0.28
Model 3
26.30
26.91

0.61
Model 2 (+Improved Model 1)
26.51
27.05
0.54
Model 3 (+Improved Model 1-2)
26.30
27.44
1.14
Table II
T HE “ BOOSTING ” PERFORMANCE FOR THE E NGLISH -F RENCH
TRANSLATION SYSTEM .

2 LGIZA

VI. ACKNOWLEDGEMENT
This work is supported by the project ”Studying Methods
for Analyzing and Summarizing Opinions from Internet
and Building an Application” which is funded by Vietnam

is available on: />
123


National University of Hanoi. It is also supported by the
project KC.01.TN04/11-15

[11] H. Wu, H. Wang, and Z. yi Liu, “Alignment model adaptation
for domain-specific word alignment,” in ACL, 2005.


R EFERENCES

[12] W. A. Gale and K. W. Church, “Identifying word
correspondence in parallel texts,” in Proceedings of the
workshop on Speech and Natural Language, ser. HLT ’91.
Stroudsburg, PA, USA: Association for Computational
Linguistics, 1991, pp. 152–157. [Online]. Available:
/>
[1] F. J. Och and H. Ney, “Discriminative training
and maximum entropy models for statistical machine
translation,” in Proceedings of the 40th Annual Meeting on
Association for Computational Linguistics, ser. ACL ’02.
Stroudsburg, PA, USA: Association for Computational
Linguistics, 2002, pp. 295–302. [Online]. Available:
/>
[13] F. J. Och and H. Ney, “A systematic comparison of
various statistical alignment models,” Comput. Linguist.,
vol. 29, pp. 19–51, March 2003. [Online]. Available:
/>
[2] ——,
“The
alignment
template
approach
to
statistical machine translation,” Comput. Linguist.,
vol. 30, pp. 19–51, June 2004. [Online]. Available:
/>
[14] C. Hoang, A. Le, P. Nguyen, and T. Ho, “Exploiting nonparallel corpora for statistical machine translation,” in Proceedings of The 9th IEEE-RIVF International Conference
on Computing and Communication Technologies.

IEEE
Computer Society, 2012, pp. 97 – 102.

[3] P. Koehn, F. J. Och, and D. Marcu, “Statistical phrase-based
translation,” in Proceedings of the 2003 Conference of the
North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume
1, ser. NAACL ’03. Stroudsburg, PA, USA: Association for
Computational Linguistics, 2003, pp. 48–54.

[15] C. Hoang, A. Le, and B. Pham, “A systematic comparison
of various statistical alignment models for statistical englishvietnamese phrase-based translation (to appear),” in Proceedings of The 4th International Conference on Knowledge and
Systems Engineering. IEEE Computer Society, 2012.

[4] D. Chiang, “A hierarchical phrase-based model for statistical
machine translation,” in Proceedings of the 43rd Annual
Meeting on Association for Computational Linguistics,
ser. ACL ’05. Stroudsburg, PA, USA: Association for
Computational Linguistics, 2005, pp. 263–270. [Online].
Available: />
[16] P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens,
C. Dyer, O. Bojar, A. Constantin, and E. Herbst, “Moses:
open source toolkit for statistical machine translation,” in
Proceedings of the 45th Annual Meeting of the ACL on
Interactive Poster and Demonstration Sessions, ser. ACL
’07. Stroudsburg, PA, USA: Association for Computational
Linguistics, 2007, pp. 177–180.

[5] G. Sanchis-Trilles and F. Casacuberta, “Log-linear weight
optimisation via bayesian adaptation in statistical machine
translation,” in Proceedings of the 23rd International Conference on Computational Linguistics: Posters, ser. COLING

’10. Stroudsburg, PA, USA: Association for Computational
Linguistics, 2010, pp. 1077–1085. [Online]. Available:
/>
[17] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a
method for automatic evaluation of machine translation,” in
Proceedings of the 40th Annual Meeting on Association for
Computational Linguistics, ser. ACL ’02. Stroudsburg, PA,
USA: Association for Computational Linguistics, 2002, pp.
311–318.

[6] J. B. Mari`oo, R. E. Banchs, J. M. Crego, A. de Gispert,
P. Lambert, J. A. R. Fonollosa, and M. R. Costa-juss`a,
“N-gram-based machine translation,” Comput. Linguist.,
vol. 32, no. 4, pp. 527–549, Dec. 2006. [Online]. Available:
/>[7] D. Vilar, M. Popovi´c, and H. Ney, “AER: Do we need to
”improve” our alignments?” in International Workshop on
Spoken Language Translation, Kyoto, Japan, Nov. 2006, pp.
205–212.
[8] C. Hoang, A. Le, and B. Pham, “Refining lexical translation
training scheme for improving the quality of statistical phrasebased translation (to appear),” in Proceedings of The 3th
International Symposium on Information and Communication
Technology. ACM digital library, 2012.
[9] P. F. Brown, V. J. D. Pietra, S. A. D. Pietra, and
R. L. Mercer, “The mathematics of statistical machine
translation: parameter estimation,” Comput. Linguist.,
vol. 19, pp. 263–311, June 1993. [Online]. Available:
/>[10] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the em algorithm,”
JOURNAL OF THE ROYAL STATISTICAL SOCIETY, SERIES
B, vol. 39, no. 1, pp. 1–38, 1977.


124



×