Parallel Texts Extraction
from the Web
by
Le Quang Hung
Faculty of Information Technology
University of Engineering and Technology
Vietnam National University, Hanoi
Supervised by
Dr. Le Anh Cuong
A thesis submitted in fulfillment of the requirements for
the degree of Master of Information Technology
December, 2010
TIEU LUAN MOI download :
Contents
ORIGINALITY STATEMENT
i
Abstract
ii
Acknowledgements
iii
List of Figures
vi
List of Tables
vii
1 Introduction
1.1 Parallel corpus and its role . . . . . . . . . . . . . .
1.2 Current studies on automatically extracting parallel
1.3 Objectives of the thesis . . . . . . . . . . . . . . . .
1.4 Contributions . . . . . . . . . . . . . . . . . . . . .
1.5 Thesis’ structure . . . . . . . . . . . . . . . . . . .
2 Related works
2.1 The general framework .
2.2 Structure-based methods
2.3 Content-based methods .
2.4 Hybrid methods . . . . .
2.5 Summary . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3 The proposed approach
3.1 The proposed model . . . . . . . . . . . . . . . .
3.1.1 Host crawling . . . . . . . . . . . . . . . .
3.1.2 Content-based filtering module . . . . . .
3.1.2.1 The method based on cognation .
3.1.2.2 The method based on identifying
ments . . . . . . . . . . . . . . .
3.1.3 Structure analysis module . . . . . . . . .
.
.
.
.
.
. . . .
corpus
. . . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
translation seg. . . . . . . . .
. . . . . . . . .
.
.
.
.
.
1
1
3
4
5
5
.
.
.
.
.
7
7
8
12
14
15
.
.
.
.
16
16
17
18
20
. 23
. 28
iv
TIEU LUAN MOI download :
Contents
3.2
v
3.1.4 Classification modeling . . . . . . . . . . . . . . . . . . . . . 30
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4 Experiment
4.1 Evaluation measures
4.2 Experimental setup .
4.3 Experimental results
4.4 Discussion . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
32
32
33
36
40
5 Conclusion and Future Works
41
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2 Future works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
A LIBSVM tool
43
B Relevant publications
44
Bibliography
45
TIEU LUAN MOI download :
List of Figures
1.1
An example of English-Vietnamese parallel texts. . . . . . . . . . .
2.1
2.2
2.3
2.4
2.5
2.6
2.7
General architecture in building parallel corpus.
The STRAND architecture [1]. . . . . . . . . . .
An example of aligning two documents. . . . . .
The workflow of the PTMiner system [2]. . . . .
The algorithm of translation pairs finder [3]. . .
Architecture of the PTI system [4]. . . . . . . .
An example of the two links in the text. . . . .
.
.
.
.
.
.
.
7
9
10
11
13
13
15
3.1
3.2
3.3
3.4
3.5
3.6
17
18
19
20
22
3.13
Architecture of the Parallel Text Mining system. . . . . . . . . . . .
Architecture of a standard Web crawler. . . . . . . . . . . . . . . .
An example of a candidate pair. . . . . . . . . . . . . . . . . . . . .
Description of the process content-based filtering module. . . . . . .
An example of two corresponding texts of English and Vietnamese.
The algorithm measures similarity of cognates between a texts pair
(Etext, V text). . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Relationships between bilingual web pages. . . . . . . . . . . . . . .
The paragraphs can be denoted from HTML pages based on the
tag < p >. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Identifying translation paragraphs. . . . . . . . . . . . . . . . . . .
A sample code written in Java to perform translation from English
into Vietnamese via Google AJAX API. . . . . . . . . . . . . . . .
Web documents and the source HTML code for two parallel translated texts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
An example of the publication date feature is extracted from a
HTML page. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Classification model. . . . . . . . . . . . . . . . . . . . . . . . . . .
30
31
4.1
4.2
4.3
4.4
Figure for precision and recall measures. . . . . . . . . .
The format of training and testing data. . . . . . . . . .
Performance of identifying translation segments method.
Comparison of the methods. . . . . . . . . . . . . . . . .
32
34
38
39
3.7
3.8
3.9
3.10
3.11
3.12
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2
22
24
25
27
27
29
vi
TIEU LUAN MOI download :
List of Tables
1.1
Europarl parallel corpus: 10 aligned language pairs all of which
include English. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
3.1
Symbols and descriptions . . . . . . . . . . . . . . . . . . . . . . . . 30
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10
URLs from three sites: BBC, VOA News and VietnamPlus . . . .
No. pages downloaded and No. candidate pairs. . . . . . . . . . .
Structure-based method. . . . . . . . . . . . . . . . . . . . . . . .
Content-based method. . . . . . . . . . . . . . . . . . . . . . . . .
Method based on cognation. . . . . . . . . . . . . . . . . . . . . .
Combining structural features and cognate information. . . . . . .
Identifying translation at document level. . . . . . . . . . . . . . .
Identifying translation at paragraph level. . . . . . . . . . . . . .
Identifying translation at sentence level. . . . . . . . . . . . . . .
Overall results of each method (P-Precision, R-Recall, F-FS core).
.
.
.
.
.
.
.
.
.
.
33
34
36
36
37
37
37
38
38
39
vii
TIEU LUAN MOI download :
Chapter 1
Introduction
In this chapter, we first introduce about parallel corpus and its role in NLP applications. Current studies, objectives of the thesis and contributions are then
presented. Finally, the thesis’ structure is shortly described.
1.1
Parallel corpus and its role
Parallel text
Different definitions of the term “parallel text” (also known as bitext) can be
found in the literature. As common understanding, a parallel text is a text in
one language together with its translation in another language. Dan Tufis [5]
gives a definition: “parallel text is an association between two texts in different
languages that represent translations of each other”. Figure 1.1 shows an example
of English-Vietnamese parallel texts.
Parallel corpus
A parallel corpus is a collection of parallel texts. According to [6], the simplest
case is where two languages only are involved, one of the corpora is an exact
translation of the other (e.g., COMPARA corpus [7]). However, some parallel
corpora exist in several languages. For instance, Europarl parallel corpus [8] which
includes versions in 11 European languages as report in Table 1.1. In addition, the
direction of the translation need not be constant, so that some texts in a parallel
1
TIEU LUAN MOI download :
Chapter 1. Introduction
2
Figure 1.1: An example of English-Vietnamese parallel texts.
corpus may have been translated from language L1 to language L2 and others the
other way around. The direction of the translation may not even be known.
The parallel corpora exist in several formats. They can be raw parallel texts or
they can be aligned texts. The texts can be aligned in paragraph level, sentence
level or even in phrase level and word level. The alignment of the texts is useful for
different NLP tasks. Statistical machine translation [9, 10] uses parallel sentences
as the input for the alignment module which produces word translation probabilities. Cross language information retrieval [11–13] uses parallel texts for determining corresponding information in both questioning and answering. Extracting
semantically equivalent components of the parallel texts as words, phrases, sentences are useful for bilingual dictionary construction [14, 15]. The parallel texts
are also used for acquisition of lexical translation [16] or word sense disambiguation
[17]. For most of the mentioned tasks, the parallel corpora are currently playing
a crucial role in NLP applications.
TIEU LUAN MOI download :
Chapter 1. Introduction
3
Table 1.1: Europarl parallel corpus: 10 aligned language pairs all of which
include English.
Parallel Corpus (L1 -L2 )
Danish-English
German-English
Greek-English
Spanish-English
Finnish-English
French-English
Italian-English
Dutch-English
Portuguese-English
Swedish-English
1.2
Sentences L1 Words
1,684,664 43,692,760
1,581,107 41,587,670
960,356
1,689,850 48,860,242
1,646,143 32,355,142
1,723,705 51,708,806
1,635,140 46,380,851
1,715,710 47,477,378
1,681,991 47,621,552
1,570,411 38,537,243
English Words
46,282,519
43,848,958
27,468,389
46,843,295
45,136,552
47,915,991
47,236,441
47,166,762
47,000,805
42,810,628
Current studies on automatically extracting
parallel corpus
Nowadays, along with the development the Internet, the Web is really a huge
database containing multi-language documents thus it is useful for bilingual texts
processing. For that reason, many studies [1–4, 18–22] are paying their attention
in mining parallel corpora from the Web. Basically, we can classify these studies
into three groups: content-based (CB) [3, 4, 22], structure-based (SB) [1, 2, 18],
and hybrid (combination of the both methods) [19–21].
The CB approach uses the textual content of the parallel document pairs being
evaluated. This approach usually uses lexicon translations getting from a bilingual
dictionary to measure the similarity of content of the two texts. When the bilingual dictionary is available, documents are translated word by word to the target
language. The translated documents then are used to find the best matching parallel documents by applying similarity scores functions such as cosine, Jaccard,
Dice, etc. However, using bilingual dictionary may face difficulty because a word
usually has many its translations.
Meanwhile, the SB approach relies on analysis HTML structure of pages. This
approach uses the hypothesis that parallel web pages are presented in similar
structures. The similarity of the web pages are estimated based on the structural
HTML of them. Note that this approach does not require linguistical knowledge.
TIEU LUAN MOI download :
Chapter 1. Introduction
4
In addition, this approach is very effective in filtering a big number of unmatched
documents, as it is quite fast but accuracy. Nevertheless, it has drawbacks that
requires the presentation of two sites with similar content must be presented in
the same. From our observation, many sites use the same template to design the
Web, the structure of pages is similar but the content of them is different. For
that reason, HTML structure-based approach is not applicable in some cases.
1.3
Objectives of the thesis
As we have introduced, the parallel corpus is the valuable resource for different
NLP tasks. Unfortunately, the available parallel corpora are not only in relatively
small size, but also unbalanced even in the major languages [3]. Some resources
are available, such as for English-French, the data are usually restricted to government documents (e.g., the Hansard corpus) or newswire texts. The others are
limited availability due to licensing restrictions as [23]. According to [24], there are
now some reliable parallel corpora: Hansard Corpus1 , JRC-Acquis Parallel Corpus2 , Europarl3 , and COMPARA4 . However, these resources only exist for some
language pairs.
In Vietnam, the NLP is in early stage. The lack of parallel corpora is more
severe. The lack of such kind of resource has been an obstacle in the development
of the data-driven NLP technologies. There are a few studies of mining parallel
corpora from the Web, one of them is presented in [22] (for English-Vietnamese
language pair). On the other hand, the current studies [1–4, 18–21] while extremely
useful, they have a few drawbacks as mentioned in Section 1.2. So, obtaining a
parallel corpus with high quality is still a challenge. That is why it still remains a
big motivation for many studies on this work.
The objective of this research is extracting parallel texts from bilingual web sites
of the English and Vietnamese language pair. We first propose two new methods of
designing content-based features: (1) based on cognation, (2) based on identifying
translation segments. Then, we combine content-based features with structural
features under a framework of machine learning.
1
/> />3
/>4
/>2
TIEU LUAN MOI download :
Chapter 1. Introduction
1.4
5
Contributions
In our work, we aim to automatically extracting English-Vietnamese parallel texts.
As encouraging by [20] we formulate this problem as classification problem to
utilize as much as possible the knowledge from structural information and the
similarity of content. The most important contribution of our work is that we
proposed two new methods of designing content-based features and combined with
structural-based features to extract parallel texts from bilingual web sites.
• The first method based on cognation. It is worth to emphasize that different
from previous studies [2, 20], we use cognate information replace of word by
word translation. From our observation, when translating a text from one
language to another, some special parts will be kept or changed in a little.
These parts are usually abbreviation, proper noun, and number. We also
use other content-based features such as the length of tokens, the length of
paragraphs, which also do not require any linguistically analysis. It is worth
to note that by this approach we do not need any dictionary thus we think
it can be apply for other language pairs.
• The second method based on identifying translation segments use to match
translation paragraphs. That will help us to extract proper translation units
in bilingual web pages. Previous studies usually use lexicon translations
getting from a bilingual dictionary to measure the similarity of content of
the two texts, such as in [4, 20]. This approach may face difficulty because
a word usually has many its translations. Differently, we use the Google
translator because by using it we can utilize the advantages of a statistical
machine translation. It helps to disambiguating lexical ambiguity, translating phrases, and reordering.
1.5
Thesis’ structure
Given below is a brief outline of the topics discussed in next sections of this thesis:
Chapter 2 - Related works
The studies that have close relations with our work are introduced in this chapter.
TIEU LUAN MOI download :
Chapter 1. Introduction
6
Chapter 3 - The proposed approach
We show our proposed model, including the general architecture of the model, how
structural features and content-based features are designed and estimated.
Chapter 4 - Experiment
This chapter evaluates the goodness and effectiveness of our proposed method for
extracting parallel texts from the Web. The performance of our proposed and
baseline are presented in here.
Chapter 5 - Conclusion and Future works
Final conclusions about our work as a whole and the evaluation of the results in
particular are presented, followed by suggestions of possible future work that could
be done.
Finally, references introduce researches that are closely related to our work.
TIEU LUAN MOI download :
Chapter 2
Related works
In this chapter, we outline the general framework in building parallel corpus. Then,
we review the studies that have close relations with our work.
2.1
The general framework
Figure 2.1: General architecture in building parallel corpus.
7
TIEU LUAN MOI download :
Chapter 2. Related works
8
In general, there are two approaches in building the parallel corpus (illustrated in
Figure 2.1). The first one is automatically collect bilingual documents from the
Web. The process of identifying parallel texts is a simple step-by-step procedure:
(1) locate bilingual web sites, (2) crawl for URLs of possible parallel web pages,
and (3) match parallel pages. The content features and structural features used
to extract parallel texts (the detail of this task is presented in the next sections).
The other one based on the monolingual corpora [25]. As seen from the diagram,
starting with two large monolingual corpora (a non-parallel corpus) divided into
documents, this approach is composed of three steps: (1) selecting pairs of similar documents, (2) from each such pair, generate all possible sentence pairs and
pass them through a simple word-overlap-based filter, thus obtaining candidate
sentence pairs, and (3) the candidates are presented to a maximum entropy (ME)
classifier that decides whether the sentences in each pair are mutual translations
of each other.
The next section will present some related works for mining parallel corpus from
the Web.
2.2
Structure-based methods
Parallel web pages in a site in general speaking have comparable structures and
contents. Therefore, a big number of studies focus on finding characteristics of
HTML structures such as URL links, filename, and HTML tags. Recently, several
systems have been developed to find parallel web pages from the Web. In this
section, we describe two of these systems: Original STRAND [1, 18] and PTMiner
[2].
The Original STRAND is an architecture for structural translation recognition, acquiring natural data. Its goal is to identify pairs of web pages that are
mutual translations. In order to do this, it exploits an observation about the way
that web page authors disseminate information in multiple languages: When presenting the same content in two different languages, authors exhibit a very strong
tendency to use the same document structure. The STRAND therefore locates
pages that might be translations of each other, via a number of different strategies, and filters out page pairs whose page structures diverge by too much. The
STRAND architecture has three basic steps (illustrated in Figure 2.2):
TIEU LUAN MOI download :
Chapter 2. Related works
9
Figure 2.2: The STRAND architecture [1].
• Location of pages that might have parallel translations,
• Generation of candidate pairs that might be translations, and
• Structural filtering out of nontranslation candidate pairs.
The heart of STRAND is a structural filtering process that relies on analysis of
the pages’ underlying HTML to determine a set of pair-specific structural values,
and then uses those values to decide whether the pages are translations of one
another. The first step in this process is to linearize the HTML structure and
ignore the actual linguistic content of the documents.
Both documents in the candidate pair are run through a markup analyzer that
acts as a transducer, producing a linear sequence containing three kinds of token:
[START:element label]
e.g., [START:H3]
[END:element label]
e.g., [END:H3]
[Chunk:length]
e.g., [Chunk:250]
The chunk length is measured in nonwhitespace bytes, and the HTML tags are normalized for case. Attribute-value pairs within the tags are treated as non-markup
text (e.g., <FONT COLOR=“BLUE”> produces [START:FONT] followed by
[Chunk:12]).
TIEU LUAN MOI download :
Chapter 2. Related works
10
The second step is to align the linearized sequences using a standard dynamic
programming technique. For example, consider two documents that begin as Figure 2.3.
Figure 2.3: An example of aligning two documents.
Using this alignment, the authors compute four values from the aligned structures which indicate the amount of non-shared material, the number of aligned
non-markup text chunks of unequal length, the correlation of lengths of the aligned
non-markup chunks, and the significance level of the correlation. Machine learning, namely decision trees, are then used for filtering, based on these four values.
PTMiner system [2] works on extracting bilingual English-Chinese documents. This system uses a search engine to locate for host containing the parallel web pages. In order to generate candidate pairs, the PTMiner uses a URLmatching process (e.g., Chinese translation of a URL as ” />../eng/..e.html” might be ” and other
TIEU LUAN MOI download :
Chapter 2. Related works
11
features such as size, date, etc. Note that the URLs do not match in most of
the bilingual English-Vietnamese web sites.
Figure 2.4: The workflow of the PTMiner system [2].
The PTMiner implements the following steps (illustrated in Figure 2.4):
1. Search for candidate sites - Using existing Web search engines, search for the
candidate sites that may contain parallel pages.
2. Filename fetching - For each candidate site, fetch the URLs of Web pages
that are indexed by the search engines.
3. Host crawling - Starting from the URLs collected in the previous step, search
through each candidate site separately for more URLs.
4. Pair scan - From the obtained URLs of each site, scan for possible parallel
pairs.
5. Download and verifying - Download the parallel pages, determine file size,
language, and character set of each page, and filter out non-parallel pairs.
In experiment, several hundred selected pairs were evaluated manually. Their
results were quite promising, from a corpus of 250 MB of English-Chinese text,
statistical evaluation showed that of the pairs identified, 90% were correct.
TIEU LUAN MOI download :
Chapter 2. Related works
2.3
12
Content-based methods
The approach discussed thus far relies heavily on document structure. However,
as Ma and Liberman [3] point out, not all translators create translated pages that
look like the original page. Moreover, structure-based matching is applicable only
in corpora that include markup, and there are certainly multilingual collections on
the Web and elsewhere that contain parallel text without structural tags. All these
considerations motivate an approach to matching translations that pays attention
to similarity of content, whether or not similarities of structure exist. In this
section, we describe three systems: Bilingual Internet Text Search (BITS) [3],
Parallel Text Identification (PTI) [4], and Dang’s system [22].
The BITS system starts with a given list of domains to search for parallel
text. In this system, a translation lexicon (each entry of a translation lexicon lists
a word in language L1 and its translation in language L2 ) is used to find translation
token pairs. For a given text A in language L1 , they first tokenize A and every
B in language L2 . The similarity between A and every text B in language L2 is
measured as an algorithm in Figure 2.5. Then finding the B which is most similar
to A, if the similarity between A and B is greater than a given threshold t, then A
and B are declared a translation pair. The similarity between A and B is defined
as
sim( A, B) =
Number of translation token pairs
Number of tokens in text A
(2.1)
In experiment, Ma and Liberman use an English-German bilingual lexicon of
117,793 entries. The authors report 99.1% precision and 97.1% recall on a handpicked set of 600 documents (half in each language) containing 240 translation
pairs (as judged by humans).
The PTI system (illustrated in Figure 2.6) crawls the Web to fetch parallel
multilingual Web documents using a Web spider. To determine the parallelism
between potential bilingual document pairs, two different modules are developed.
A filename comparison module is used to check filename resemblance. A content
analysis module is used to measure the degree of semantic similarity. It incorporates a novel content-based similarity scoring method for measuring the degree
of parallelism for every potential document pair based on their semantic content
TIEU LUAN MOI download :
Chapter 2. Related works
13
Figure 2.5: The algorithm of translation pairs finder [3].
using a bilingual wordlist. The results showed that the PTI system achieves a
precision rate of 93% and a recall rate of 96% (180 instances is correct among a
total of 193 pairs extracted).
Figure 2.6: Architecture of the PTI system [4].
In our knowledge, there are rarely studies on this field related to Vietnamese.
[22] built an English-Vietnamese parallel corpus based on content-based matching.
Firstly, candidate web page pairs are found by using the features of sentence length
and date. Then, they measure similarity of content using a bilingual EnglishVietnamese dictionary and making decision that whether two papers are parallel
based on some thresholds of this measure. Note that this system only searches
TIEU LUAN MOI download :
Chapter 2. Related works
14
for parallel pages that are good translations of each other and they are required
being written in the same style. Moreover, using word by word translation will
cause much ambiguity. Therefore, this approach is difficult to extend when the
data increases as well as when applying for bilingual web sites with various styles.
Another instance of this approach is that instead of using bilingual dictionary,
a simple word-based statistical machine translation is used to translate texts in
one language to the other. [26] uses this method to build an English-Chinese
parallel corpus from a huge text collection of Xinhua Web bilingual news corpora
collected by LDC1 . By adding newly built parallel corpus to their existing corpus,
they reported an increase in the translation quality of their word-based statistical
machine translation in terms of word alignment. A bootstrapping approach [27]
can also be applied to incrementally increase number of both parallel sentences
and bilingual lexical vocabulary.
2.4
Hybrid methods
The last version of STRAND [20] is another well-known web parallel text mining
system. Its goal is to identify pairs of web pages that are mutual translations.
They used the AltaVista search engine to search for multilingual web sites and
generated candidate pairs based on manually created substitution rules. The heart
of STRAND is a structural filtering process that relies on analysis of the pages underlying HTML to determine a set of pair-specific structural values, and then uses
those values to filter the candidate pairs. This system also proposes a new method
that combines content-based and structure matching by using a cross-language
similarity score as an additional parameter of the structure-based method. A
translation lexicon is used to link tokens between pairs of parallel document. The
link be a pair (x, y) in which x is a word in language L1 and y is a word in L2 .
An example of two texts with links is illustrated in Figure 2.7. Using the results
of MCBM2 they defined a tsim translational similarity measure as
tsim =
1
2
Number of two-word links in best matching
Number of links in best matching
(2.2)
Linguistic Data Consortium, at />Problem of maximum cardinality bipartite matching
TIEU LUAN MOI download :
Chapter 2. Related works
15
Figure 2.7: An example of the two links in the text.
In experiment, approximately 400 pairs were evaluated by human annotators. The
STRAND produced fewer than 3500 English-Chinese pairs with a precision of 98%
and a recall of 61%.
In others systems, [19] proposed a method that combining length-base and
content-based methods to do parallel text matching exploiting only title part of
web page. They achieved 100% accuracy but the recall is not high as in many cases,
the title of corresponding text is not well translated. In [21], they use URL-based,
length-based, content-based and HTML structure features incorporated within knearest-neighbours classifier to do parallel text matching for English-Chinese. To
identify a bilingual web site, they using the anchor and ALT text information
within HTML page. If some of pages have those text that match a list of predefined strings that indicate English and Chinese, the page will be considered as
a bilingual page. [28] proposed a similar approach. The author presents a system that automatically collects bilingual texts from the Internet. The criteria for
parallel text detection is based on the size, HTML structures and word by word
translation model.
2.5
Summary
In this chapter, we presented related works for mining parallel corpus from the
Web. The content-based approach usually uses a bilingual dictionary to match
pairs of word-word in two languages. Meanwhile, structure-based approach relies
on analysis HTML structure of pages. In the real implementation, both approaches
are usually employed to get good performance. Generally, the structure-based
methods are applied to quickly filter out the documents that are apparently not
matched with a given document, after that the content-based methods are applied
to find the right translational document pairs.
TIEU LUAN MOI download :
Chapter 3
The proposed approach
In this chapter, we introduce our proposed model, including the general architecture of the model, how structural features and content-based features are designed
and estimated. We also represent the classification modeling in our system.
3.1
The proposed model
In this work, our proposed approach whose it is combination content-based features
and structure-based features of the HTML pages to extract parallel texts from the
Web by using machine learning [20]. The machine learning algorithm used here is
Support Vector Machine (SVM). Figure 3.1 illustrates the general architecture of
our proposed model. As shown in the model it includes the following tasks:
• Firstly, we use a crawler on the specified domains to extract bilingual EnglishVietnamese pages which are called raw data.
• Secondly, from the raw data, we will create candidates of parallel web pages
by using some threshold of extracted features (content-based features and
the feature about date).
• Thirdly, we manually label these candidates and then we have a training
data. It means that we will obtain some pairs of parallel web pages which
are assigned with label 1, and some other pairs of parallel web pages which
are assigned with label 0 (the detail of this task is presented in the experiment
section).
16
TIEU LUAN MOI download :
Chapter 3. The proposed approach
17
Figure 3.1: Architecture of the Parallel Text Mining system.
• Fourthly, we will extract structural features and content-based features so
that each web page pair can be represented as a vector of these features.
This representation is required to fit a classification model.
• Finally, we use a SVM tool to train a classification system on this training
data. It means that if we have a pair of English-Vietnamese web pages for
test, then the obtained classification will decide whether it is parallel or not.
3.1.1
Host crawling
According to [29], Web crawling is the process of locating, fetching, and storing
the pages on the Web. The computer programs that perform this task are referred
to as Web crawlers or spiders. In general terms, the working of a Web crawler is
as Figure 3.2. A typical Web crawler, starting from a set of seed pages, locates
new pages by parsing the downloaded pages and extracting the hyper-links (in
short links) within. Extracted links are stored in a FIFO fetch queue for further
TIEU LUAN MOI download :
Chapter 3. The proposed approach
18
Figure 3.2: Architecture of a standard Web crawler.
retrieval. Crawling continues until the fetch queue gets empty or a satisfactory
number of pages are downloaded.
In our work, bilingual English-Vietnamese web pages are collected by crawling
the Web using a Web spider as in [4]. To execute this process, our system uses the
Teleport-Pro1 to retrieve web pages from remote web sites. Teleport-Pro is a tool
designed to download the documents on the Web via HTTP and FTP protocols
and store the extracted data in disk [3]. Note that, we select the URLs on the
specified hosts from the three news sites: BBC, VietnamPlus, and VOA News.
For example, the URL on the BBC site for English is ”” and
” for Vietnamese. Then, we use Teleport-Pro
to download the HTML pages for obtaining the candidate web pages.
3.1.2
Content-based filtering module
The HTML pages are converted to plain text after they are retrieved from remote
web sites. Note that, the original web pages usually contain less user interface
components such as JavaScript, Flash, etc. So, we use a simple script to clean
them and extract only text when content-based matching.
1
/>
TIEU LUAN MOI download :
Chapter 3. The proposed approach
19
Figure 3.3: An example of a candidate pair.
As common understanding, using content-based features we want to determine
whether two pages are mutual translation. However, as [3] pointed out, not all
translators create translated pages that look like the original page. Moreover,
structure-based matching is applicable only in corpora that include markup, and
there are certainly multilingual collections on the Web and elsewhere that contain
parallel text without structural tags [20]. Many studies have used this approach
to build a parallel corpus from the Web such as [4, 22]. They use a bilingual
dictionary to measure the similarity of the contents of two texts. However, this
TIEU LUAN MOI download :
Chapter 3. The proposed approach
20
method can cause much ambiguity because a word usually has many its translations. For English-Vietnamese, one word in English can correspond to multiple
words in Vietnamese. To overcome this limitation, we propose two new methods
of designing features: (1) based on cognation, (2) based on identifying translation
segments.
Figure 3.4: Description of the process content-based filtering module.
3.1.2.1
The method based on cognation
This method use cognate information, which provides a cheap, and reasonable resource. This proposal is based on an observation that a document usually contains
some cognates and if two documents are mutual translations then the cognates are
usually kept the same in both of them. The cognates are words that are spelled
similarly in two languages, or words that simply are not translated (e.g., abbreviations). For example, if the word “WTO” appears in an English text, it probably
also appears as “WTO” in a Vietnamese text. Note that, [30] also use cognates
but for sentence alignment. We divide a token which is considered as a cognate
into three types as follows:
1. The abbreviations (e.g., “EU”, “WTO”),
2. The proper nouns in English (e.g., “Vietnam”, “Paris”), and
TIEU LUAN MOI download :