VIETNAM NATIONAL UNIVERSITY, HANOI
UNIVERSITY OF ENGINEERING AND TECHNOLOGY
HAI-LONG TRIEU
BILINGUAL SENTENCE ALIGNMENT
BASED ON SENTENCE LENGTH AND
WORD TRANSLATION
MASTER THESIS OF INFORMATION TECHNOLOGY
Hanoi - 2014
2
VIETNAM NATIONAL UNIVERSITY, HANOI
UNIVERSITY OF ENGINEERING AND TECHNOLOGY
HAI-LONG TRIEU
BILINGUAL SENTENCE ALIGNMENT
BASED ON SENTENCE LENGTH AND
WORD TRANSLATION
Major: Computer science
Code: 60 48 01
MASTER THESIS OF INFORMATION TECHNOLOGY
SUPERVISOR: PhD. Phuong-Thai Nguyen
Hanoi - 2014
3
ORIGINALITY STATEMENT
„I hereby declare that this submission is my own work and to the best of my
knowledge it contains no materials previously published or written by another person, or
substantial proportions of material which have been accepted for the award of any other
degree or diploma at University of Engineering and Technology (UET) or any other
educational institution, except where due acknowledgement is made in the thesis. I also
declare that the intellectual content of this thesis is the product of my own work, except to
the extent that assistance from others in the project‟s design and conception or in style,
presentation and linguistic expression is acknowledged.‟
Signed
4
Acknowledgements
I would like to thank my advisor, PhD Phuong-Thai Nguyen, not only for his
supervision but also for his enthusiastic encouragement, right suggestion and knowledge
which I have been giving during studying in Master‟s course. I would also like to show
my deep gratitude M.A Phuong-Thao Thi Nguyen from Institute of Information
Technology - Vietnam Academy of Science and Technology - who provided valuable
data in my evaluating process. I would like to thank PhD Van-Vinh Nguyen for
examining and giving some advices to my work, M.A Kim-Anh Nguyen, M.A Truong
Van Nguyen for their help along with comments on my work, especially M.A Kim-Anh
Nguyen for supporting and checking some issues in my research.
In addition, I would like to express my thanks to lectures, professors in Faculty of
Information Technology, University of Engineering and Technology (UET), Vietnam
University, Hanoi who teach me and helping me whole time I study in UET.
Finally, I would like to thank my family and friends for their support, share, and
confidence throughout my study.
5
Abstract
Sentence alignment plays an important role in machine translation. It is an essential
task in processing parallel corpora which are ample and substantial resources for natural
language processing. In order to apply these abundant materials into useful applications,
parallel corpora first have to be aligned at the sentence level.
This process maps sentences in texts of source language to their corresponding units in
texts of target language. Parallel corpora aligned at sentence level become a useful
resource for a number of applications in natural language processing including Statistical
Machine Translation, word disambiguation, cross language information retrieval. This
task also helps to extract structural information and derive statistical parameters from
bilingual corpora.
There have been a number of algorithms proposed with different approaches for
sentence alignment. However, they may be classified into some major categories. First of
all, there are methods based on the similarity of sentence lengths which can be measured
by words or characters of sentences. These methods are simple but effective to apply for
language pairs that have a high similarity in sentence lengths. The second set of methods
is based on word correspondences or lexicon. These methods take into account the lexical
information about texts, which is based on matching content in texts or uses cognates. An
external dictionary may be used in these methods, so these methods are more accurate but
slower than the first ones. There are also methods based on the hybrids of these first two
approaches that combine their advantages, so they obtain quite high quality of alignments.
In this thesis, I summarize general issues related to sentence alignment, and I evaluate
approaches proposed for this task and focus on the hybrid method, especially the proposal
of Moore (2002), an effective method with high performance in term of precision. From
analyzing the limits of this method, I propose an algorithm using a new feature, bilingual
word clustering, to improve the quality of Moore‟s method. The baseline method (Moore,
2002) will be introduced based on analyzing of the framework, and I describe advantages
as well as weaknesses of this approach. In addition to this, I describe the basis knowledge,
algorithm of bilingual word clustering, and the new feature used in sentence alignment.
Finally, experiments performed in this research are illustrated as well as evaluations to
prove benefits of the proposed method.
Keywords: sentence alignment, parallel corpora, natural language processing, word
clustering.
6
Table of Contents
ORIGINALITY STATEMENT 3
Acknowledgements 4
Abstract 5
Table of Contents 6
List of Figures 9
List of Tables 10
CHAPTER ONE Introduction 11
1.1. Background 11
1.2. Parallel Corpora 12
1.2.1. Definitions 12
1.2.2. Applications 12
1.2.3. Aligned Parallel Corpora 12
1.3. Sentence Alignment 12
1.3.1. Definition 12
1.3.2. Types of Alignments 12
1.3.3. Applications 15
1.3.4. Challenges 15
1.3.5. Algorithms 16
1.4. Thesis Contents 16
1.4.1. Objectives of the Thesis 16
1.4.2. Contributions 17
1.4.3. Outline 17
1.5. Summary 18
CHAPTER TWO Related Works 19
2.1. Overview 19
2.2. Overview of Approaches 19
7
2.2.1. Classification 19
2.2.2. Length-based Methods 19
2.2.3. Word Correspondences Methods 21
2.2.4. Hybrid Methods 21
2.3. Some Important Problems 22
2.3.1. Noise of Texts 22
2.3.2. Linguistic Distances 22
2.3.3. Searching 23
2.3.4. Resources 23
2.4. Length-based Proposals 23
2.4.1. Brown et al., 1991 23
2.4.2. Vanilla: Gale and Church, 1993 24
2.4.3. Wu, 1994 27
2.5. Word-based Proposals 27
2.5.1. Kay and Roscheisen, 1993 27
2.5.2. Chen, 1993 27
2.5.3. Melamed, 1996 28
2.5.4. Champollion: Ma, 2006 29
2.6. Hybrid Proposals 30
2.6.1. Microsoft’s Bilingual Sentence Aligner: Moore, 2002 30
2.6.2. Hunalign: Varga et al., 2005 31
2.6.3. Deng et al., 2007 32
2.6.4. Gargantua: Braune and Fraser, 2010 33
2.6.5. Fast-Champollion: Li et al., 2010 34
2.7. Other Proposals 35
2.7.1. Bleu-align: Sennrich and Volk, 2010 35
2.7.2. MSVM and HMM: Fattah, 2012 36
2.8. Summary 37
CHAPTER THREE Our Approach 39
3.1. Overview 39
8
3.2. Moore‟s Approach 39
3.2.1. Description 39
3.2.2. The Algorithm 40
3.3. Evaluation of Moore‟s Approach 42
3.4. Our Approach 42
3.4.1. Framework 42
3.4.2. Word Clustering 43
3.4.3. Proposed Algorithm 45
3.4.4. An Example 49
3.5. Summary 50
CHAPTER FOUR Experiments 51
4.1. Overview 51
4.2. Data 51
4.2.1. Bilingual Corpora 51
4.2.2. Word Clustering Data 53
4.3. Metrics 54
4.4. Discussion of Results 54
4.5. Summary 57
CHAPTER FIVE Conclusion and Future Work 58
5.1. Overview 58
5.2. Summary 58
5.3. Contributions 58
5.4. Future Work 59
5.4.1. Better Word Translation Models 59
5.4.2. Word-Phrase 59
Bibliography 60
9
List of Figures
Figure 1.1. A sequence of beads (Brown et al., 1991). 13
Figure 2.1. Paragraph length (Gale and Church, 1993). 25
Figure 2.2. Equation in dynamic programming (Gale and Church, 1993) 26
Figure 2.3. A bitext space in Melamed‟s method (Melamed, 1996). 29
Figure 2.4. The method of Varga et al., 2005 31
Figure 2.5. The method of Braune and Fraser, 2010 33
Figure 2.6. Sentence Alignment Approaches Review. 38
Figure 3.1. Framework of sentence alignment in our algorithm. 43
Figure 3.2. An example of Brown's cluster algorithm 44
Figure 3.3. English word clustering data 44
Figure 3.4. Vietnamese word clustering data 44
Figure 3.5. Bilingual dictionary 46
Figure 3.6. Looking up the probability of a word pair 47
Figure 3.7. Looking up in a word cluster 48
Figure 3.8. Handling in the case: one word is contained in dictionary 48
Figure 4.1. Comparison in Precision 55
Figure 4.2. Comparison in Recall 56
Figure 4.3. Comparison in F-measure 57
10
List of Tables
Table 1.1. Frequency of alignments (Gale and Church, 1993) 14
Table 1.2. Frequency of beads (Ma, 2006) 14
Table 1.3. Frequency of beads (Moore, 2002) 14
Table 1.4. An entry in a probabilistic dictionary (Gale and Church, 1993) 15
Table 2.1. Alignment pairs (Sennrich and Volk, 2010) 36
Table 4.1. Training data-1 51
Table 4.2. Topics in Training data-1 52
Table 4.3. Training data-2 52
Table 4.4. Topics in Training data-2 52
Table 4.5. Input data for training clusters 53
Table 4.6. Topics for Vietnamese input data to train clusters 53
Table 4.7. Word clustering data sets. 54
11
CHAPTER ONE
Introduction
1.1. Background
Parallel corpora play an important role in a number of tasks such as machine
translation, cross language information retrieval, word disambiguation, sense
disambiguation, bilingual lexicography, automatic translation verification, automatic
acquisition of knowledge about translation, and cross-language information retrieval.
Building a parallel corpus, therefore, helps connecting considered languages [1, 5, 7, 12-
13, 15-16].
Parallel texts, however, are useful only when they have to be sentence-aligned. The
parallel corpus first is collected from various resources, which has a very large size of the
translated segments forming it. This size is usually of the order of entire documents and
causes an ambiguous task in learning word correspondences. The solution to reduce the
ambiguity is first decreasing the size of the segments within each pair, which is known as
sentence alignment task. [7, 12-13, 16]
Sentence alignment is a process that maps sentences in the text of the source language
to their corresponding units in the text of the target language [3, 8, 12, 14, 20]. This task
is the work of constructing a detailed map of the correspondence between a text and its
translation (a bitext map) [14]. This is the first stage for Statistical Machine Translation.
With aligned sentences, we can perform further analyses such as phrase and word
alignment analysis, bilingual terminology, and collocation extraction analysis as well as
other applications [3, 7-9, 17]. Efficient and powerful sentence alignment algorithms,
therefore, become increasingly important.
A number of sentence alignment algorithms have been proposed [1, 7, 9, 12, 15, 17,
20]. Some of these algorithms are based on sentence length [3, 8, 20]; some use word
correspondences [5, 11, 13-14]; some are hybrid of these two approaches [2, 6, 15, 19].
Additionally, there are also some other outstanding methods for this task [7, 17]. For
details of these sentence alignment algorithms, see Sections 2.3, 2.4, 2.5, 2.6.
I propose an improvement to an effective hybrid algorithm [15] that is used in
sentence alignment. For details of our approach, see Section 3.4. I also create experiments
12
to illustrate my research. For details of the corpora used in our experiments, see Section
4.2. For results and discussions of experiments, see Sections 4.4, 4.5.
In the rest of this chapter, I describe some issues related to the sentence alignment
task. In addition to this, I introduce objectives of the thesis and our contributions. Finally,
I describe the structure of this thesis.
1.2. Parallel Corpora
1.2.1. Definitions
Parallel corpora are a collection of documents which are translations of each other
[16]. Aligned parallel corpora are collections of pairs of sentences where one sentence is a
translation of the other [1].
1.2.2. Applications
Bilingual corpora are an essential resource in multilingual natural language processing
systems. This resource helps to develop data-driven natural language processing
approaches. This also contributes to applying machine learning to machine translation
[15-16].
1.2.3. Aligned Parallel Corpora
Once the parallel text is sentence aligned, it provides the maximum utility [13].
Therefore, this makes the task of aligning parallel corpora of considerable interest, and a
number of approaches have been proposed and developed to resolve this issue.
1.3. Sentence Alignment
1.3.1. Definition
Sentence alignment is the task of extracting pairs of sentences that are translation of
one another from parallel corpora. Given a pair of texts, this process maps sentences in
the text of the source language to their corresponding units in the text of the target
language [3, 8, 13].
1.3.2. Types of Alignments
Aligning sentences is to find a sequence of alignments. This section provides some
more definitions about “alignment” as well as issues related to it.
Brown et al., 1991, assumed that every parallel corpus can be aligned in terms of a
sequence of minimal alignment segments, which they call “beads”, in which sentences
align 1-to-1, 1-to-2, 2-to-1, 1-to-0, 0-to-1.
13
Figure 1.1. A sequence of beads (Brown et al., 1991).
Groups of sentence lengths are circled to show the correct alignment. Each of the
groupings is called a bead, and there is a number to show sentence length of a sentence in
the bead. In figure 1.1, “17e” means the sentence length (17 words) of an English
sentence, and “19f” means the sentence length (19 words) of a French sentence. There is a
sequence of beads as follows:
An -bead (one English sentence aligned with one French sentence) followed by
An -bead (one English sentence aligned with two French sentences) followed
by
An -bead (one English sentence) followed by
A bead (one English paragraph and one French paragraph).
An alignment, then, is simply a sequence of beads that accounts for the observed
sequences of sentence lengths and paragraph markers [3].
There are quite a number of beads, but it is possible to only consider some of them
including 1-to-1 (one sentence of source language aligned with one sentence of target
language), 1-to-2 (one sentence of source language aligned with two sentences of target
language), etc; Brown et al., 1991 [3] mentioned to beads 1-to-1, 1-to-0, 0-to-1, 1-to-2, 2-
to-1, and a bead of paragraphs (
, ,
) because of considering alignments by
paragraphs of this method. Moore, 2002 [15] only considers five of these beads: 1-to-1, 1-
to-0, 0-to-1, 1-to-2, 2-to-1 in which each of them is called as follows:
1-to-1 bead (a match)
1-to-0 bead (a deletion)
0-to-1 bead (an insertion)
1-to-2 bead (an expansion)
2-to-1 bead (a contraction)
14
The common information related to this is the frequency of beads. Table 1.1 shows
frequencies of types of beads proposed by Gale and Church, 1993 [8].
Table 1.1. Frequency of alignments (Gale and Church, 1993)
Category
Frequency
Prob(match)
1-1
1167
0.89
1-0 or 0-1
13
0.0099
2-1 or 1-2
117
0.089
2-2
15
0.011
1312
1.00
Meanwhile, these frequencies of Ma, 2006 [13] are illustrated as Table 1.2:
Table 1.2. Frequency of beads (Ma, 2006)
Category
Frequency
Percentage
1-1
1306
89.4%
1-0 or 0-1
93
6.4%
1-2 or 2-1
60
4.1%
Others
2
0.1%
Total
1461
Table 1.3 also describes these frequencies of types of beads in Moore, 2002 [15]:
Table 1.3. Frequency of beads (Moore, 2002)
Category
Percentage
1-1
94%
1-2
2%
2-1
2%
1-0
1%
0-1
1%
Total
100%
15
Generally, the frequency of bead 1-to-1 in almost all corpora is largest in all types of
beads, with frequency around 90% whereas other types are only about few percentages.
1.3.3. Applications
Sentence alignment is an important topic in Machine Translation. This is an important
first step for Statistical Machine Translation. It is also the first stage to extract structural
and semantic information and to derive statistical parameters from bilingual corpora [17,
20]. Moreover, this is the first step to construct probabilistic dictionary (Table 1.4) for use
in aligning words in machine translation, or to construct a bilingual concordance for use
in lexicography.
Table 1.4. An entry in a probabilistic dictionary (Gale and Church, 1993)
English
French
Prob(French|English)
the
le
0.610
the
la
0.178
the
l‟
0.083
the
les
0.023
the
ce
0.013
the
il
0.012
the
de
0.009
the
à
0.007
the
que
0.007
1.3.4. Challenges
Although this process might seem very easy, it has some important challenges which
make the task difficult [9]:
The sentence alignment task is non-trivial because sentences do not always align 1-to-
1. At times a single sentence in one language might be translated as two or more
sentences in the other language. The input text also affects the accuracies. The
performance of sentence alignment algorithms decreases significantly when input data
becomes very noisy. Noisy data means that there are more 1-0 and 0-1 alignments in the
data. For example, there are 89% 1-1 alignments in English-French corpus (Gale and
Church, 1991), and 1-0 and 0-1 alignments are only 1.3% in this corpus. Whereas in UN
16
Chinese English corpus (Ma, 2006), there are 89% 1-1 alignments, but 1-0 or 0-1
alignments are 6.4% in this corpus. Although some methods work very well on clean
data, their performance goes down quickly as data becomes noisy [13].
In addition, it is difficult to achieve perfect accurate alignments even if the texts are
easy and “clean”. For instance, the success of an alignment program may decline
dramatically when applied on a novel or philosophy text, but this program gives
wonderful results when applied on a scientific text.
The performance alignment also depends on languages of corpus. For example, an
algorithm based on cognates (words in language pairs that resemble each other
phonetically) is likely to work better for English-French than for English-Hindi because
there are fewer cognates for English-Hindi [1].
1.3.5. Algorithms
A sentence alignment program is called “ideal” if it is fast, highly accurate, and
requires no special knowledge about the corpus or the two languages [2, 9, 15]. A
common requirement for sentence alignment approaches is the achievement of both high
accuracy and minimal consumption of computational resources [2, 9]. Furthermore, a
method for sentence alignment should also work in an unsupervised fashion and be
language pair independent in order to be applicable to parallel corpora in any language
without requiring a separate training set. A method is unsupervised if it is an alignment
model directly from the data set to be aligned. Meanwhile, language pair independence
means that approaches require no specific knowledge about the languages of the parallel
texts to align.
1.4. Thesis Contents
This section introduces the organization of contents in this thesis including: objectives,
our contributions, and the outline.
1.4.1. Objectives of the Thesis
In this thesis, I report results of my study of sentence alignment and approaches
proposed for this task. Especially, I focus on Moore‟s method (2002), a method which is
outstanding and has a number of advantages. I also discover a new feature, word
clustering, which may apply for this task to improve the accuracy of alignment. I examine
this proposal in experiments and compare results to those in the baseline method to prove
advantages of my approach.
17
1.4.2. Contributions
My main contributions are as follows:
Evaluating methods in sentence alignment and introducing an algorithm that
improves Moore‟s method.
Using new feature - word clustering, helps to improve accuracy of alignment.
This contributes in complementing strategies in the sentence alignment
problem.
1.4.3. Outline
The rest of the thesis is organized as follows:
Chapter 2 – Related Works
In this chapter I introduce some recent research about sentence alignment. In order to
have a general view of methods proposed to deal this problem, an overall presentation
about methods of sentence alignment is introduced in this chapter. Methods are classified
into some types in which each method is given by describing its algorithm along with
evaluations related to it.
Chapter 3 – Our Approach
This chapter describes the method we proposed in sentence alignment to improve
Moore‟s method. Initially, an analysis of Moore‟s method and evaluations about it are
also mentioned in this chapter. The major content of this chapter is the framework of the
proposed method, an algorithm using bilingual word clustering. An example is described
in this chapter to illustrate the approach clearly.
Chapter 4 – Experiments
This chapter shows experiments performed in our approach. Data corpora used in
experiments are presented completely. Results of experiments as well as discussions
about them are clearly described for evaluating our approach to the baseline method.
Chapter 5 –Conclusions and Future Works
In this last chapter, advantages and restrictions of my works are summarized in a
general conclusion. Besides, some research directions are mentioned to improve the
current model in the future.
Finally, references are given to show research published that my system refers to.
18
1.5. Summary
This chapter introduces my research work. I have given background information about
parallel corpora, sentence alignment, definitions of issues as well as some initial problems
related to sentence alignment algorithms. Terms of alignment which are used in this task
have been defined in this chapter. In addition, an outline of my research work in this
thesis has also been provided. A discussion of future proposed work is also presented.
19
CHAPTER TWO
Related Works
2.1. Overview
This chapter is an introduction to some research in sentence alignment in recent years
and some evaluations about these approaches. A number of problems related to this work
are also discussed: factors that affect the performance of alignment algorithms, searching
and resources for each method. Evaluations of algorithm are introduced to give a general
view of advantages as well as weaknesses of each algorithm.
Section 2.2 provides an overview of sentence alignment approaches. Section 2.3
introduces and evaluates some primary approaches in length-based methods. Section 2.4
introduces and evaluates proposals of word-correspondence-based approaches. Proposals
as well as evaluations for each of them in hybrid methods are presented in Section 2.5.
Certainly, there are some other outstanding approaches about this task, which are also
introduced in Section 2.6. Section 2.7 concludes this chapter.
2.2. Overview of Approaches
2.2.1. Classification
From the first approaches proposed in 1990s, there have been a number of
publications reported in sentence alignment with different techniques.
In various sentence alignment algorithms which have been proposed, there are three
widespread approaches based respectively on a comparison of sentence length, lexical
correspondence and a combination of these first two methods.
There are also some other techniques such as methods based on BLEU score, support
vector machine, and hidden Markov model classifiers.
2.2.2. Length-based Methods
Length-based approaches are based on modeling the relationship between the lengths
of sentences that are mutual translations. The length is measured by characters or words
of a sentence. In these approaches, semantics of the text are not considered. Statistical
methods are used for this task instead of the content of texts. In other words, these
20
methods only consider the length of sentences in order to make the decision for
alignment.
These methods are based on the fact that longer sentences in one language tend to be
translated into longer sentences in the other language, and that shorter sentences tend to
be translated into shorter sentences. A probabilistic score is assigned to each proposed
correspondence of sentences, based on the scaled difference of lengths of the two
sentences (in characters) and the variance of this difference. There are two random
variables
1
and
2
which are the lengths of the two sentences under consideration. It is
assumed that these random variables are independent and identically distributed with a
normal distribution [8].
Given the two parallel texts (source text) and (target text), the goal of this task
is to find alignment A which is highest probability.
(, , )
In order to estimate this probability, aligned text is decomposed in a sequence of
aligned sentence beads where each bead is assumed to be independent of others.
The algorithms of this type were first proposed in Brown, et al., 1991 and Gale and
Church, 1993. These approaches use sentence-length statistics in order to model the
relationship between groups of sentences that are translations of each other. Wu (Wu,
1994) also uses the length-based method by applying the algorithm proposed by Gale and
Church, and he further uses lexical cues from corpus-specific bilingual lexicon to improve
alignment.
The methods proposed in this type of sentence alignment algorithm are based solely
on the lengths of sentences, so they require almost no prior knowledge. Furthermore,
these methods are highly accurate despite their simplicity. They can also perform in a
high speed. When aligning texts whose languages are similar or have a high length
correlation such as English, French, and German, these approaches are especially useful
and work remarkably well. They also perform fairly well if the input text is clean such as
in Canadian Hansards corpus [3]. The Gale and Church algorithm is still widely used
today, for instance to align Europarl (Koehn, 2005).
Nevertheless, these methods are not robust since they only use the sentence length
information. They will no longer be reliable if there is too much noise in the input
bilingual texts. As shown in (Chen, 1993) [5] the accuracy of sentence-length based
methods decreases drastically when aligning texts containing small deletions or free
translation; they can easily misalign small passages because they ignore word identities.
The algorithm of Brown et al. requires corpus-dependent anchor points while the method
21
proposed by Gale and Church depends on prior alignment of paragraphs to constrain the
search. When aligning texts where the length correlation breaks down, such as the
Chinese-English language pair, the performance of length-based algorithms declines
quickly.
2.2.3. Word Correspondences Methods
The second approach, one that tries to overcome the disadvantages of length-based
approaches, is the word-based method that is based on lexical information from
translation lexicons, and/or through the recognition of cognates. These methods take into
account the lexical information about texts. Most algorithms match content in one text
with their correspondences in the other text, and use these matches as anchor points in the
task sentence alignment. Words which are translations of each other may have similar
distribution in the source language and target language texts. Meanwhile, some methods
use cognates (words in language pairs that resemble each other phonetically) rather than
the content of word pairs to determine beads of sentences.
This type of sentence alignment methods may be illustrated in some outstanding
approaches such as Kay and Röscheisen, 1993 [11], Chen, 1993 [5], Melamed, 1996 [14],
and Ma, 2006 [13]. Kay‟s work has not proved efficient enough to be suitable for large
corpora while Chen constructs a word-to-word translation model during alignment to
assess the probability of an alignment. Word correspondence was further developed in
IBM Model-1 (Brown et al., 1993) for statistical machine translation. Meanwhile, word
correspondence in another way (geometric correspondence) for sentence alignment is
proposed by Melamed, 1996.
These algorithms have higher accuracy in comparison with length-based methods.
Because they use the lexical information from source and translation lexicons rather than
only sentence length to determine the translation relationship between sentences in the
source text and the target text, these algorithms usually are more robust than the length-
based algorithms.
Nevertheless, algorithms based on a lexicon are slower than those based on length
sentence because they require considerably more expensive computation. In addition to
this, they usually depend on cognates or a bilingual lexicon. The method of Chen requires
an initial bilingual lexicon; the proposal of Melamed, meanwhile, depends on finding
cognates in the two languages to suggest word correspondences.
2.2.4. Hybrid Methods
Sentence length and lexical information are also combined in order that different
approaches can complement on each other and achieve more efficient algorithms.
22
These approaches are proposed in Moore, 2002; Varga et al., 2005; and Braune and
Fraser, 2010. Both approaches have two passes in which a length-based method is used
for a first alignment and this subsequently serves as training data for a translation model,
which is then used in a complex similarity score. Moore, 2002 proposes a two-phase
method that combines sentence length (word count) in the first pass and word
correspondences (IBM Model-1) in the second one. Varga et al. (2005) also use the
hybrid technique in sentence alignment by combining sentence length with word
correspondences (using a dictionary-based translation model in which the dictionary can
be manually expanded). Braune and Fraser, 2010 also propose an algorithm similar to
Moore except that this approach has the technique to build 1-to-many and many-to-1
alignments rather than focus only on 1-to-1 alignment as Moore‟s method.
The hybrid approaches achieve a relatively high performance and overcome limits of
the first two methods along with combining their advantages. The approach of Moore,
2002 obtains a high precision (fraction of retrieved documents that are in fact relevant)
and computational efficiency. Meanwhile, the algorithm proposed by Varga et al., 2005
which has the same idea as Moore, 2002 gains a very high recall rate (fraction of relevant
documents that are retrieved by the algorithm).
Nonetheless, there are still weaknesses which should be handled in order to obtain a
more efficient sentence alignment algorithm. In Moore‟s method, the recall rate is rather
low, and this fact is especially problematic when aligning parallel corpora with much
noise or sparse data. The approach of Varga et al., 2005, meanwhile, gets a very high
recall value; however, it still has a quite low precision rate.
2.3. Some Important Problems
2.3.1. Noise of Texts
Because texts have been extracted from some other format like web pages, there are
some issues related to the texts. In actual corpora, sentences may not be translations at all
or may not even be part of the actual text; therefore, they cause the noise. As some other
problems in texts, translation of a text can be a recreation or it can be fairly literal, with a
whole range between these two extremes. In addition to this, sentences and/or paragraphs
may be added or dropped. Sentences, on the other hand, can also be split or merged; thus,
the source and target language corpora may have different sizes.
2.3.2. Linguistic Distances
Another parameter which can also affect the performance of sentence alignment
algorithms is the linguistic distance between source language and target language.
Linguistic distance means the extent to which languages differ from each other. For
23
example, English is linguistically “closer” to Western European languages (such as
French and German) than it is to East Asian languages (such as Korean and Japanese).
There are some measures to assess the linguistic distance such as the number of cognate
words, syntactic features. It is important to recognize that some algorithms may not
perform so well if they rely on the closeness between languages while these languages are
distant. An obvious example for this is that a method is likely to work better for English-
French or English-German than for English-Hindi if is based on cognates because of
fewer cognates in English-Hindi. Hindi belongs to the Indo-Aryan branch whereas
English and German belongs to the Indo-Germanic one.
2.3.3. Searching
Dynamic programming is the technique that most sentence alignment tools use in
searching the best path of sentence pairs through a parallel text. This also means that the
texts are ordered monotonically and none of these algorithms is able to extract sentence
pairs in crossing positions. Nevertheless, most of these programs have advantages in
using this technique in searching and none of them reports weaknesses about it. It is
because the characteristic of translations is that almost all sentences have same order in
both source and target texts.
In this aspect, algorithms may be confronted with problems of the search space. Thus,
pruning strategies to restrict the search space is also an issue that algorithms have to
resolve.
2.3.4. Resources
All systems learn their respective models from the parallel text itself. There are only
some algorithms which support the use of external resources such as Hunalign (Varga et
al., 2005) with bilingual dictionaries and Bleualign (Sennrich and Volk, 2010) with
existing MT systems.
2.4. Length-based Proposals
2.4.1. Brown et al., 1991
This algorithm uses a statistical technique for aligning sentences. It is based solely on
the number of words in each sentence whereas the actual identities of words are ignored.
It relies on the general idea that the closer in length two sentences are, the more likely
they align. Brown et al. use the information about the number of tokens in sentences to
calculate alignments. In addition to this, certain anchor points which are available in the
data are used in restricting searching alignments.
24
To perform searching for the best alignment, Brown et al. use dynamic programming.
This technique requires time quadratic in the length of the text aligned, so it is not
practical to align a large corpus as a single unit. The computation of searching may be
reduced dramatically if the bilingual corpus is subdivided into smaller chunks. This
subdivision is performed by using anchors in this algorithm. An anchor is a piece of text
likely to be present at the same location in both of the parallel corpora of a bilingual
corpus. Dynamic programming first is used to align anchors, and then this technique is
applied again to align the text between anchors.
The alignment computation of this algorithm is fast since it makes no use of the
lexical details of the sentence. Therefore, it is practical to apply this method to very large
collections of text, especially to high correlation languages pairs.
2.4.2. Vanilla: Gale and Church, 1993
This algorithm performs sentence alignment based on a statistical model of sentence
lengths measured by characters. It uses the fact that longer sentences in one language tend
to be translated into longer sentences in another language.
This algorithm is similar to the proposal of Brown et al. except that the former is
based on the number of words whereas the latter is based on the number of characters in
sentences. In addition, the algorithm of Brown et al. aligns a subset of the corpus for
further research instead of focusing on entire articles. The work of Gale and Church
(1991) supports this promise of wider applicability.
This sentence alignment program has two steps. First paragraphs are aligned, and then
sentences within a paragraph are aligned. This algorithm reports that paragraph lengths
are highly correlated. Figure 2.1 illustrates this correlation of the languages pair: English
and German.
25
Figure 2.1. Paragraph length (Gale and Church, 1993).
A probabilistic score is assigned to each proposed correspondence of sentences, based
on the scaled difference of lengths of the two sentences and the variance of this
difference. This score is used in a dynamic programming framework to find the maximum
likelihood alignment of sentences. The use of dynamic programming allows the system to
consider all possible alignments and find the minimum cost alignment effectively.
A distance function is defined in a general way to allow for insertions, deletions,
substitution, etc. The function takes four arguments:
1
,
1
,
2
,
2
.
Let (
1
,
1
; 0, 0) be the cost of substitution
1
with
1
,
(
1
, 0; 0, 0) be the cost of deleting
1
,
(0,
1
; 0, 0) be the cost of insertion of
1
,
(
1
,
1
;
2
, 0) be the cost of contracting
1
and
2
to
1
,
(
1
,
1
; 0,
2
) be the cost of expanding
1
to
1
and
2
, and
(
1
,
1
;
2
,
2
) be the cost of merging
1
and
2
and matching with
1
and
2
.
The Dynamic Programming Algorithm is summarized in the following recursion
equation.