Tải bản đầy đủ (.pdf) (54 trang)

Enhancing the quality of machine translation system using cross lingual word embedding models

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (753.02 KB, 54 trang )

Enhancing the quality of Machine
Translation System Using Cross-Lingual
Word Embedding Models

Nguyen Minh Thuan
Faculty of Information Technology
University of Engineering and Technology
Vietnam National University, Hanoi
Supervised by
Associate Professor. Nguyen Phuong Thai

A thesis submitted in fulfillment of the requirements
for the degree of
Master of Science in Computer Science
November 2018


2


ORIGINALITY STATEMENT
‘I hereby declare that this submission is my own work and to the best of my knowledge
it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree
or diploma at University of Engineering and Technology (UET/Coltech) or any other
educational institution, except where due acknowledgement is made in the thesis. Any
contribution made to the research by others, with whom I have worked at UET/Coltech
or elsewhere, is explicitly acknowledged in the thesis. I also declare that the intellectual
content of this thesis is the product of my own work, except to the extent that assistance
from others in the project’s design and conception or in style, presentation and linguistic
expression is acknowledged.’


Hanoi, November 15th , 2018
Signed ........................................................................

i


ii

ABSTRACT
In recent years, Machine Translation has shown promising results and received much
interest of researchers. Two approaches that have been widely used for machine translation are Phrase-based Statistical Machine Translation (PBSMT) and Neural Machine Translation (NMT). During translation, both approaches rely heavily on large
amounts of bilingual corpora which require much effort and financial support. The
lack of bilingual data leads to a poor phrase-table, which is one of the main components of PBSMT, and the unknown word problem in NMT. In contrast, monolingual
data are available for most of the languages. Thanks to the advantage, many models
of word embedding and cross-lingual word embedding have been appeared to improve
the quality of various tasks in natural language processing. The purpose of this thesis
is to propose two models for using cross-lingual word embedding models to address
the above impediment. The first model enhances the quality of the phrase-table in
SMT, and the remaining model tackles the unknown word problem in NMT.
Publications:
Minh-Thuan Nguyen, Van-Tan Bui, Huy-Hien Vu, Phuong-Thai Nguyen and Chi-Mai Luong.
Enhancing the quality of Phrase-table in Statistical Machine Translation for Less-Common and
Low-Resource Languages. In the 2018 International Conference on Asian Language Processing
(IALP 2018).


iii

ACKNOWLEDGEMENTS
I would like to express my sincere gratitude to my lecturers in university, and

especially to my supervisors - Assoc.Prof. Nguyen Phuong Thai, Dr. Nguyen Van
Vinh and MSc. Vu Huy Hien. They are my inspiration, guiding me to get the better
of many obstacles in the completion this thesis.
I am grateful to my family. They usually encourage, motivate and create the
best conditions for me to accomplish this thesis.
I would like to also thank my brother, Nguyen Minh Thong, my friends,
Tran Minh Luyen, Hoang Cong Tuan Anh, for giving me many useful advices and
supporting my thesis, my studying and my living.
Finally, I sincerely acknowledge the Vietnam National University, Hanoi and
especially, TC.02-2016-03 project named “Building a machine translation system
to support translation of documents between Vietnamese and Japanese to help
managers and businesses in Hanoi approach Japanese market” for supporting
finance to my master study.


To my family ♥

iv


Table of Contents
1 Introduction

1

2 Literature review
2.1 Machine Translation . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.2 Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.1.4 Open-Source Machine Translation . . . . . . . . . . . . . . .
2.1.4.1 Moses - an Open Statistical Machine Translation
System . . . . . . . . . . . . . . . . . . . . . . . .
2.1.4.2 OpenNMT - an Open Neural Machine Translation
System . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Word Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 Monolingual Word Embedding Models . . . . . . . . . . . .
2.2.2 Cross-Lingual Word Embedding Models . . . . . . . . . . .

.
.
.
.
.

4
4
4
5
7
8

.

9

.
.
.
.


10
11
12
13

3 Using Cross-Lingual Word Embedding Models for Machine Translation Systems
3.1 Enhancing the quality of Phrase-table in SMT Using Cross-Lingual
Word Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.1 Recomputing Phrase-table weights . . . . . . . . . . . . . . .
3.1.2 Generating new phrase pairs . . . . . . . . . . . . . . . . . . .
3.2 Addressing the Unknown Word Problem in NMT Using Cross-Lingual
Word Embedding Models . . . . . . . . . . . . . . . . . . . . . . . . .

17
17
18
19
21

4 Experiments and Results
27
4.1 Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
v


TABLE OF CONTENTS
4.2.1
4.2.2

4.2.3
5 Conclusion

vi

Word Translation Task . . . . . . . . . . . . . . . . . . . . . . 31
Impact of Enriching the Phrase-table on SMT system . . . . . 32
Impact of Removing the Unknown Words on NMT system . . 35
38


List of Figures
2.1

2.2

The CBOW model predicts the current word based on the context,
and the Skip-gram predicts surrounding words based on the current
word. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Toy illustration of the cross-lingual embedding model. . . . . . . . . . 14

3.1
3.2
3.3

Flow of training phrase. . . . . . . . . . . . . . . . . . . . . . . . . . 22
Flow of testing phrase. . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Example in testing phrase. . . . . . . . . . . . . . . . . . . . . . . . . 25

vii



List of Tables
3.1

The sample of new phrase pairs generated by using projections of
word vector representations . . . . . . . . . . . . . . . . . . . . . . . 21

4.1
4.2
4.3
4.4

Monolingual corpora . . . . . . . . . . . . . . . . . . . . . . . . . . .
Bilingual corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Bilingual dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . .
The precision of word translation retrieval top-k nearest neighbors in
Vietnamese-English and Japanese-Vietnamese language pairs. . . . .
Results on UET and TED dataset in the PBSMT system for VietnameseEnglish and Japanese-Vietnamese respectively . . . . . . . . . . . . .
Translation examples of the PBSMT in Vietnamese-English . . . . .
Results of removing unknown words on UET and TED dataset in
the NMT system for Vietnamese-English and Japanese-Vietnamese
respectively . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Translation examples of the NMT system in Vietnamese-English . . .

4.5
4.6
4.7

4.8


viii

28
28
29
32
33
34

35
37


List of Abbreviations
MT
SMT
PBSMT
NMT
NLP
RNN
CNN
UNMT

Machine Translation
Statistical Machine Translation
Phrase-based Statistical Machine Translation
Neural Machine Translation
Natural Language Processing
Recurrent Neural Network

Convolutional Neural Network
Unsupervised Neural Machine Translation

ix


Chapter 1
Introduction
Machine Translation (MT) is a sub-field of computational linguistics. It is automated translation, which translates text or speech from one natural language to
another by using computer software. Nowadays, machine translation systems attain
much success in practice, and two approaches that have been widely used for MT are
Phrase-based statistical machine translation (PBSMT) and Neural Machine Translation (NMT). In the PBSMT system, the core of this system is the phrase-table,
which contains words and phrases for SMT system to translate. In the translation
process, sentences are split into distinguished parts as shown in (Koehn et al., 2007)
(Koehn, 2010). At each step, for a given source phrase, the system will try to find
the best candidate amongst many target phrases as its translation based mainly on
phrase-table. Hence, having a good phrase-table possibly makes translation systems
improve the quality of translation. However, attaining a rich phrase-table is a challenge since the phrase-table is extracted and trained from large amounts of bilingual
corpora which require much effort and financial support, especially for less-common
languages such as Vietnamese, Laos, etc. In the NMT system, two main components
are encoder and decoder. the encoder component uses a neural network, such as the
recurrent neural network (RNN), to encode the source sentence, and the decoder
component also uses a neural network to predict words in the target language. Some
NMT models incorporate attention mechanisms to improve the translation quality.
To reduce the computational complexity, conventional NMT systems often limit
their vocabularies to be the top 30K-80K most frequent words in the source and
target language, and all words outside the vocabulary, called unknown words, are
replaced into a single unk symbol. This approach leads to the inability to generate

1



2
the proper translation for this unknown words during testing as shown in (Luong
et al., 2015b) (Li et al., 2016)
Latterly, there are several approaches to address the above impediments. With
the problem in the PBSMT system. (Passban et al., 2016) proposed a method of
using new scores generated by a Convolution Neural Network which indicates the semantic relatedness of phrase pairs. They attained an improvement of approximately
0.55 BLEU score. However, their method is suitable for medium-size corpora and
creates more scores for the phrase-table which can increase computation complexity
of all translation systems.
(Cui et al., 2013) utilized techniques of pivot languages to enrich their phrase-table.
Their phrase-table is made of source-pivot and pivot-target phrase-tables. As a
result of this combination, they attained a significant improvement of translation.
Similarly, (Zhu et al., 2014) used a method based on pivot languages to calculate the
translation probabilities of source-target phrase pairs and achieved a slight enhancement. Unfortunately, the methods based on pivot languages are not able to apply
for the Vietnamese language since the the less-common nature of this language.
(Vogel and Monson, 2004) improved the translation quality by using phrase pairs
from an augmented dictionary. They first augmented the dictionary using simple
morphological variations and then assigned probabilities to entries of this dictionary
by using co-occurrence frequencies collected from bilingual data. However, their
method needs a lot of bilingual corpora to estimate accurately the probabilities for
dictionary entries, which are not available for low-resource languages.
In order to address the unknown word problem in NMT system. (Luong et al.,
2015b) annotated the training bilingual corpus with explicit alignment information
that allows the NMT system to emit, for each unknown word in the target sentence,
the position of its corresponding word in the source sentence. This information is
then used in a post-processing step to translate every unknown word by using a
bilingual dictionary. The method showed a substantial improvement of up to 2.8
BLEU points over various NMT systems on WMT’14 English-French translation

task. However, having the good dictionary, which is utilized in the post-processing
step, is also costly and time-consuming.
(Sennrich et al., 2016) introduced a simple approach to handle the translation of
unknown words in NMT by encoding unknown words as a sequence of subword units.
This method based on the intuition that a variety of word classes are translated via
smaller units than words. For example, names are translated by character copying or


3
transliteration, compounds are translated via compositional translation, etc. The
approach indicated an improvement up to 1.3 BLEU over a back-off dictionary
baseline model on WMT 15 English-Russian translation task.
(Li et al., 2016) proposed a novel substitution-translation-restoration method to
tackle the problem of the NMT unknown word. In this method, the substitution
step replaces the unknown words in a testing sentence with similar in-vocabulary
words based on a similarity model learned from monolingual data. The translation
step then translates the testing sentence with a model trained on bilingual data with
unknown words replaced. Finally, the restoration step substitutes the translations of
the replaced words by that of original ones. This method demonstrated a significant
improvement up to 4 BLEU points over the attention-based NMT on Chinese-toEnglish translation.
Recently, techniques using word embedding receive much interest from natural
language processing communities. Word embedding is a vector representation of
words which conserves semantic information and their contexts words. Additionally,
we can exploit the advantage of embedding to represent words in diverse distinction
spaces as shown in (Mikolov et al., 2013b). Besides, cross-lingual word embedding
models are also receiving a lot of interest, which learn cross-lingual representations
of words in a joint embedding space to represent meaning and transfer knowledge in
cross-lingual scenarios. Inspired by the advantages of the cross-lingual embedding
models, the work of (Mikolov et al., 2013b) and (Li et al., 2016), we propose a
model to enhance the quality of a phrase-table by recomputing the phrase weights

and generating new phrase pairs for the phrase-table, and a model to address the
unknown word problem in the NMT system by replacing the unknown words with
the most appropriate in-vocabulary words.
The rest of this thesis is organized as follows: Chapter 2 gives an overview of
related backgrounds. In Chapter 3, we describe our two proposed models. A model
enhances the quality of phrase-table in SMT, and the remaining model tackles the
unknown word problem in NMT. Settings and results of our experiments are shown
in Chapter 4. We indicate our conclusion and future works in Chapter 5.


Chapter 2
Literature review
In this chapter, we indicate an overview of Machine Translation (MT) research and
Word Embedding models in section 2.1 and 2.2 respectively. Section 2.1 shows the
history, approaches, evaluation and open-source in MT. In section 2.2, we introduce
an overview of Word Embedding including Monolingual and Cross-Lingual Word
Embedding models.

2.1
2.1.1

Machine Translation
History

Machine Translation is a sub-field of computational linguistics. It is automated
translation, which translates text or speech from one natural language to another
by using computer software. The first ideas of machine translation may have appeared in the seventh century. Descartes and Leibniz proposed theories of how to
create dictionaries by using universal numerical codes.
In the mid-1930s, Georges Artsrouni attempted to build “translation machines” by
using paper tape to create an automatic dictionary. After that, Peter Troyanskii

proposed a model including a bilingual dictionary and a method for handling grammatical issues between languages based on the Esperanto’s grammatical system.
On January 7th, 1954, at the head office of IBM in New York, the first machine
translation system was published by Georgetown-IBM experiment. It automatically
translated 60 sentences from Russian to English for the first time and opened a race
for machine translation in many countries, such as Canada, Germany, and Japan.
However, in 1966, the Automatic Language Processing Advisory Committee (AL4


2.1. Machine Translation

5

PAC) reported that the ten-year-long research failed to fulfill expectations in (Vogel
et al., 1996). During the 1980s, a lot of activities in MT were executed, especially
in Japan. At this time, research in MT typically depended on translation through a
variety of intermediary linguistic representation including syntactic, morphological,
and semantic analysis. At the end of the 1980s, since computational power increased
and became less expensive, more research was attempted in the statistical approach
for MT.
During the 2000s, research in MT has seen major changes. A lot of research has
focused on example-based machine translation and statistical machine translation
(SMT). Besides, researchers also gave more interests in hybridization by combining
morphological and syntactic knowledge into statistical systems, as well as combining
statistics with existing rule-based systems. Recently, the hot trend of MT is using a
large artificial neural network into MT, called Neural Machine Translation (NMT).
In 2014, (Cho et al., 2014) published the first paper on using neural networks in MT,
followed by a lot of research in the following few years. Apart from the research
on bilingual machine translation systems, in 2018, researchers paid much attention
to unsupervised neural machine translation (UNMT) which only used monolingual
data to train the MT system.


2.1.2

Approaches

In this section, we indicate typically approaches for MT based on linguistic rules,
statistical and neural network.
Rule-based
Rule-based Machine Translation (RBMT) is the first approach to MT, which contains more linguistic information of the source and target languages such as morphological, syntactic rules and semantic analysis. The basic approach involves parsing
and analyzing the structure of the source sentence and then converting it into the
target language based on a manually determined set of rules created by linguistic
experts. The key advantage of RBMT is that this approach can translate a wide
range of text without requiring bilingual corpus. However, creating rules for an
RBMT system is costly and time-consuming. Additionally, when translating real
texts, the rules are unable to cover all possible linguistic phenomena and they can
conflict with each other. Therefore, RBMT has mostly been replaced by SMT or
hybrid systems.


2.1. Machine Translation

6

Statistical
Statistical Machine Translation (STM) system uses statistical models to generate
translations based on the bilingual and monolingual corpus. The basic idea of SMT
comes from information theory. A sentence f in the source language is translated to
the sentence e in the target language based on the probability distribution p(e|f ).
A simple way to modeling the probability distribution p(e|f ) is to apply Bayes
Theorem, which is:

p(e|f ) ∝ p(f |e)p(e)
where p(e|f ) is the translation model, which estimates the probability of source
sentence f given the target sentence e, and p(e) is the language model, which is the
probability of seeing sentence e in the target language. Therefore, finding the best
translation ˆe is executed by maximizing the product p(e|f )p(e):
eˆ = argmaxp(e|f ) = argmaxp(f |e)p(e)
e∈e∗

e∈e∗

In order to perform the search efficiently in the huge search space e∗ , machine
translation decoder trade-off the quality and time usage by using the foreign string,
heuristics and other methods to limit the search space. Some efficient searching
algorithms, which are currently used in the decoder, are Viterbi Beam, A* stack,
Graph Model, etc. SMT has been used as the core of systems by Google Translate
and Bing Translator.
Example-based
In an Example-based machine translation (EBMT) system, a sentence is translated
by using the idea of analogy. In this approach, the corpus that is used is large of
existing translation pairs of source and target sentences. Given a new source sentence
that is to be translated, the corpus is retrieved to select the sentences that contain
similar sub-sentential parts. Then, the similar sentences are used to translate the
sub-sentential parts of the original source sentence into the target language, and
these parts are put together to generate a complete translation.
Neural Machine Translation
Neural Machine Translation (NMT) is the newest approach to MT and based on the
model of machine learning. This approach uses a large artificial neural network to
predict the likelihood of a sequence of words, typically encoding whole sentences in
a single integrated model. The structure of the NMT models is simpler than that



2.1. Machine Translation

7

of SMT models that uses vector representations (“embedding”, “continuous space
representations”) for words and internal states. The NMT contains a single sequence
model to predict one word at a time. There is no separate translation model,
language model, reordering model. The first NMT models are using a recurrent
neural network (RNN), which uses a bidirectional RNN, known as an encoder, to
encode the source sentence and a second RNN, known as a decoder, to predict words
in the target language. NMT systems can continuously learn and be adjusted to
generate the best output and require a lot of computing power. This is why these
models have only been developed strongly in recent years.

2.1.3

Evaluation

Machine Translation evaluation is essential to examine the quality of a MT system
or compare different MT systems. The simplest method to evaluate MT output is
using human judges. However, human evaluation is costly and time-consuming and
thus unsuitable for frequently developing and researching an MT system. Therefore,
various automatic methods have been studied to evaluate the quality of translation
such as Word Error Rate (WER), Position independent word Error Rate (PER),
the NIST score (Doddington, 2002), the BLEU score (Papineni et al., 2002), etc. In
our work, we use BLEU for automatic evaluating our MT system configurations.
BLEU is a popular method for automatic evaluating MT output that is quick,
inexpensive, and language-independent as shown in (Papineni et al., 2002). The
basic idea of this method is to compare n-grams of the MT output with n-grams of

the standard translation and count the number of matches. The more the matches,
the better the MT output is. A BLEU formula is shown as follows:
The BLEU n-gram precision pn are computed by summing the n-gram matches for
all the candidate sentences in the test corpus C:
pn =

C∈{Candidates}

ngram∈C

C∈{Candidates}

Countmatched (ngram)

ngram∈C

Count(ngram)

(2.1)

Next, the brevity penalty (BP) is calculated as:

BP =


1

if c > r

e(1−r/c)


if c ≤ r

(2.2)


2.1. Machine Translation

8

where c and r is the length of the candidate translation and standard translation
respectively.
Then, the BLEU score is computed as follows:
N

BLEU = BP × exp(

wn log pn )

(2.3)

n=1

where n is the orders of n-gram considered for pn and wn is the weights assigned for the n-gram precisions. In the baseline, N = 4 and weights are uniformly
distributed.

2.1.4

Open-Source Machine Translation


In order to stimulate the development of the MT research community, a variety of
free and complete toolkits for MT are provided. With the statistical (or data-driven)
approach to MT, we can consider some systems as follows:
❼ Moses1 : a complete SMT system.
❼ UCAM-SMT2 : the Cambridge SMT system.
❼ Phrasal3 : a toolkit for phrase-based SMT.
❼ Joshua4 : a decoder for syntax-based SMT.
❼ Pharaoh5 : a decoder for IBM Model 4.

Besides, because of the superiority of NMT over SMT, NMT has received much
attention from researchers and companies. The following start-of-the-art NMT systems are totally free and easy to setup:
❼ OpenNMT6 : a sytem is designed to be simple to use and easy to extend de-

veloped by Harvard university and SYSTRAN.
❼ Google-GNMT7 : a competitive sequence-to-sequence model developed by Google.
1

/> />3
/>4
/>5
/>6
/>7
/>2


2.1. Machine Translation

9

❼ Facebook-fairseq8 : a system is implemented with Convolutional Neural Net-


work (CNN), which can achieve a similar performance as the RNN-based NMT
while running nine times faster developed by Facebook AI Research.
❼ Amazon-Sockeye9 : a sequence-to-sequence framework based on Apache MXNet

are developed by Amazon.
In this part, we introduce two MT systems, which are used in our work. The first
system is Moses - an open system for SMT and the remaining system is OpenNMT
- an open system for NMT.
2.1.4.1

Moses - an Open Statistical Machine Translation System

Moses, which was introduced by (Koehn et al., 2007), is a complete open source
toolkit for statistical machine translation. It can automatically train translation
models for any language pair from a collection of translated sentences (parallel data).
Due to the trained model, an efficient search algorithm is used to quickly find the
highest probability translation among an exponential numbers of candidates.
There are two main components in Moses: the training pipeline and the decoder. The training pipeline contains a variety of tools which take the parallel
data and train it into a translation model. Firstly, the data needs to be cleaned by
inserting spaces words and punctuation (tokenisation), removing long and empty
sentences, etc. Secondly, some external tools are then used for word alignment such
as GIZA++ in (Och and Ney, 2003), MGIZA++. These word alignments are then
used to extract phrase translation pairs or hierarchical rules. These phrase pairs or
rules are then scored by using corpus-wide statistics. Finally, weights of different
statistical models are tuned to generate the best possible translations. MERT in
(Och, 2003) is used to tune weights in Moses. In the decoder process, Moses uses
the trained translation model to translate the source sentence into the target sentence. To overcome the huge search problem in decoding, Moses implements several
different algorithms for this search such as stack-based, cube-pruning, chart parsing etc. Besides, an important part of the decoder is the language model, which is
trained from the monolingual data in the target language to ensure the fluency of

the output. Moses supports many kinds of language model tools such as KENLM in
(Heafield, 2011), SRILM in (Stolcke, 2002), IRSTLM in (Federico et al., 2008), etc.
8
9

/> />

2.1. Machine Translation

10

Currently, Moses supports several effective translation models such as phrase-based,
hierarchical phrase-based, factored, syntax-based and tree-based models.
2.1.4.2

OpenNMT - an Open Neural Machine Translation System

OpenNMT is a full-featured deep learning system, which specialized in sequenceto-sequence models supporting a lot of tasks such as machine translation, summarization, image to text, etc. It is designed for complete training and deploying
NMT models. The system has been rewritten from seq2seq-attn developed at Harvard for ease of readability, efficiency, and generalizability. It contains a variety of
easy-to-reuse modules for state-of-the-art performance such as encoders, decoders,
embedding layers, attention layers, input feeding, regularization, beam search, etc.
Currently, OpenNMT has three main implementations:
❼ OpenNMT-lua: the original project, which developed with LuaTorch, ready

for quick experiments and production.
❼ OpenNMT-py: this implementation is a clone of OpenNMT-lua, which use

the more modern Pytorch, easy to extend and especially suited for research.
❼ OpenNMT-tf: This implementation is a general purpose sequence modeling


tool in TensorFlow focusing on large-scale experiments and high-performance
models.
The structure of the Neural Machine Translation system in OpenNMT is typically implemented as an encoder-decoder architecture (Bahdanau et al., 2014). The
encoder is a recurrent neural network (RNN) or a bidirectional recurrent neural
network that encodes a source sentence x = {x1 , ..., xTc } into a sequence of hidden
states h = {h1 , ..., hTc }:
ht = fenc (e(xt ), ht−1 )
(2.4)
where ht is the hidden state at time step t, e(xt ) is the embedding of xt , Tc is
the number of symbols in the source sentence, and the function fenc is the recurrent
unit such as the gated recurrent unit (GRU) or the long short-term memory (LSTM)
unit. The decoder is also a recurrent neural network which is trained to predict the
conditional probability of each symbol yt given its preceding symbols ycontext vector ct :
P (yt |y(2.5)


2.2. Word Embedding

11
rt = fdec (e(yt ), rt−1 , ct )

(2.6)

where rt is the hidden state of the decoder at time step t and updated by fdec , e(yt )
is the embedding of target symbols yt , and g is a nonlinear function that computes
the probability of yt . In each decoding step, the context vector ct is computed by
summing the weight of source hidden states:
Tc


ct =

αi hi

(2.7)

i=1

αi =

exp(score(rt−1 , hi ))
Tc
j=1

exp(score(rt−1 , hj ))

(2.8)

where score is used to compare the target hidden state rt−1 with each of source
hidden states. The function of score is shown as follows,
score(rt−1 , hi ) = veT tanh(Wr rt−1 + Wh hj )

(2.9)

where ve , Wr , Wh are trainable parameters.

2.2

Word Embedding


In recent years, techniques using word embedding receive much interest from natural language processing communities. Word embedding is a vector representation
of words which conserves semantic information and their contexts words in (Huang
et al., 2012) (Mikolov et al., 2013a) (Mikolov et al., 2013b). Additionally, we can
exploit the advantage of embedding to represent words in diverse distinction spaces
as shown in (Mikolov et al., 2013b). Besides, applying word embedding to multilingual applications is also receiving a lot of interest. Therefore, learning cross-lingual
embedding models, which learn cross-lingual representations of words in a joint
embedding space, to represent meaning and transfer knowledge in cross-lingual scenarios is necessary. In this section, we introduce models about monolingual and
cross-lingual word embedding.


2.2. Word Embedding

2.2.1

12

Monolingual Word Embedding Models

During the 1990s, vector space models have been applied for distributional semantics. A variety of models are then developed for estimating continuous representations of words such as Latent Dirichlet Allocation (LDA), Latent Semantic Analysis
(LSA), etc. The term word embeddings was first used by (Bengio et al., 2003),
who learned word representation by using a feed-forward neural network. Recently,
(Mikolov et al., 2013a) proposed new models for learning effectively distributed representation of words by using a feed-forward neural network, known as word2vec.
They provided two neural networks for learning word vectors: Continuous Skip-gram
and Continuous Bag-of-Words (CBOW). In CBOW, a feed-forward neural network
with an input layer, a projection layer, and an output layer is used to predict the
current word based context words as shown in Figure 2.1. In this architecture, the
projection layer is common among all words, the input is a window of n future words
and n history words of the current word. All the input words are projected to a
common space, and the current word is then predicted by averaging these input vectors. In contrast to CBOW, Skip-gram model uses the current word to predict the

surrounding words as shown in Figure 2.1. The input of this model is a center word,
which is fed into the projection layer and the output is 2 * n vectors for n history
and n future words. In practice, in case of limited monolingual data, Skip-gram
indicates a better word representation than CBOW. However, CBOW is faster and
is suggested for larger datasets.
A year later, (Pennington et al., 2014) introduced Global vectors (GloVe), a competitive set of pre-trained embeddings. Glove learns representations of words through
matrix factorization. Glove proposes a weighted least squares objective LGloV e ,
which minimizes the difference between the dot product of the embedding of a word
wi and its context word wj and the logarithm of their number of co-occurrences:
|V |

f (Cij )(wiT wj + bi + bj − logCij )2

LGloV e =

(2.10)

i,j=1

where wi and bi are the word vector and bias of word i, wj and bj are the context
word vector and bias, Cij captures the number of times word i occurs in the context
of word j, and f is a weighting function that assigns relatively lower weight to rare
and frequent co-occurrences.


2.2. Word Embedding

13

Figure 2.1: The CBOW model predicts the current word based on the context, and

the Skip-gram predicts surrounding words based on the current word.

2.2.2

Cross-Lingual Word Embedding Models

Cross-lingual word embeddings models learn the cross-lingual representation of words
in a joint embedding space to represent meaning and transfer knowledge in crosslingual applications. Recently, many models for learning cross-lingual embeddings
have been proposed as shown in (Ruder et al., 2017) - a survey of cross-lingual word
embedding models. In this section, we introduce three models in (Mikolov et al.,
2013b), (Xing et al., 2015) and (Conneau et al., 2017), which are used in our experiments to enhance the quality of MT system. In the models, they always assume
that they have two sets of embeddings trained independently on monolingual data.
Their work focuses on learning a mapping between two sets such that translations
are close in the shared space.
Cross-lingual embedding model in (Mikolov et al., 2013b)
(Mikolov et al., 2013b) show that they can exploit the similarities of monolingual
embedding space by learning a linear projection between vector spaces representing
each language. They first build vector representation models of languages using
large amounts of monolingual data. Next, they use a small bilingual dictionary to
learn a linear projection between the languages. For this purpose, they use a dictionary of n = 5000 word-pairs {xi , zi }i∈{1,n} to find a transformation matrix W such
that W xi approximates zi . In practice, learning the transformation matrix W can


2.2. Word Embedding

14

be considered as an optimization problem and it can be solved by minimizing the
following error function using a gradient descent method:
n


||W xi − zi ||2

min
W

(2.11)

i=1

At the test time, given new word and its continuous vector representation x, they
can map it to the other language space by computing z = Wx. the word whose
representation is closest to z in the target language space is then retrieved by using
cosine similarity as the distance metric.
Cross-lingual embedding model in (Xing et al., 2015)
Inspired by the work of (Mikolov et al., 2013b), (Xing et al., 2015) pointed that the
Euler distance in the objective function as shown in Equation 2.11 is fundamentally
different from the cosine distance, which is used to measure the ‘closeness’ of words
in the projection space and hence causes inconsistence. They solved this problem
by enforcing an orthogonality constraint on W. And then, the Equation 2.11 is
considered as the Procrustes problem in (Sch¨onemann, 1966), which provided a
solution obtained from the singular value decomposition (SVD) of ZX T , where X
and Z are two matrices of size d × n containing the embeddings of the words in the
bilingual dictionary. The formula is shown as follows:
W ∗ = argmin||W X − Z||2 = U V T , with U ΣV T = SVD(ZX T )

(2.12)

W


Cross-lingual embedding model in (Conneau et al., 2017)
The two above models reported good performance on the word translation task by
using a small bilingual dictionary to learn the linear mapping. In this model, the
authors show how to learn the mapping W without using any bilingual data, their
model even outperforms existing supervised methods on cross-lingual tasks from
some pairs of languages. An illustration of this model is shown in Figure 2.2. In

Figure 2.2: Toy illustration of the cross-lingual embedding model.


×