Tải bản đầy đủ (.pdf) (43 trang)

Extraction of Vietnamese collocation from text corpora trích chọn lọc collocation tiếng Việt từ kho ngữ liệu văn bản

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.18 MB, 43 trang )

Table of Contents
1 Introduction 2
1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Related works and motivation . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Contribution of the thesis . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Collocation: concept, roles and applications 7
2.1 Collocations’ characteristics . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Recurrent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.2 Arbitrary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.3 Domain-dependent . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.4 Non-substitutability (the closely linked in terms of vocabulary) 9
2.2 Classification of collocations . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Idiomatic Phrases . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 Support Verb Construction . . . . . . . . . . . . . . . . . . . . 10
2.2.3 Fixed Phrases . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Vietnamese collocations . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Basic methods in Collocation extraction 14
3.1 Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.1 T-Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.2 Chi-Square . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 Point-wise Mutual Information (PMI) . . . . . . . . . . . . . . . . . . 20
4 Our proposal for extracting Vietnamese collocation 23
4.1 Patterns for Vietnamese collocation . . . . . . . . . . . . . . . . . . . 23
4.2 The Linguistic Measure . . . . . . . . . . . . . . . . . . . . . . . . . 24
v
vi TABLE OF CONTENTS
4.3 Designed model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5 Experiments 27
5.1 Data preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27


5.1.1 Collecting corpora . . . . . . . . . . . . . . . . . . . . . . . . 27
5.1.2 Extracting bi-grams . . . . . . . . . . . . . . . . . . . . . . . . 28
5.1.3 Adding syntactic information to bi-grams . . . . . . . . . . . 28
5.2 The test models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.3 Experimental results with statistical methods . . . . . . . . . . . . . 30
5.3.1 Bi-grams with syntactic information . . . . . . . . . . . . . . . 31
5.4 The experiments of our proposal . . . . . . . . . . . . . . . . . . . . . 32
6 Conclusion 35
Bibliography 36
List of Figures
2.2 The collocation has Support Verb Construction . . . . . . . . . . . . 11
2.3 Fixed noun phrase . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Some type of Vietnamese fix phrase . . . . . . . . . . . . . . . . . . . 13
3.1 Sample label type for filter of Vietnamese . . . . . . . . . . . . . . . 16
3.2 Some collocations extracted by frequency . . . . . . . . . . . . . . . . 16
3.3 Some collocations extracted by the method T-Test . . . . . . . . . . . 19
3.4 Some collocations extracted by the method Chi-Square . . . . . . . . 20
3.5 Some collocations extracted by the method PMI . . . . . . . . . . . . 21
4.1 A number of bi-grams and information about the main noun/verb
and frequency of appearance . . . . . . . . . . . . . . . . . . . . . . . 26
5.1 The label used by vnTagger . . . . . . . . . . . . . . . . . . . . . . . 27
5.2 Output from the four methods on data without syntactic information 30
5.3 Output from the four methods on data has been labelled and parsed . 31
5.4 Results of experimental runs on all models with two input data sets . 32
5.5 Output from the four methods and our combined method . . . . . . . 33
5.6 Some bi-grams extracted after phase 2 of combined method . . . . . . 34
vii
List of Tables
3.1 Sample label type for filter of English . . . . . . . . . . . . . . . . . 15
3.2 Example using Chi-square . . . . . . . . . . . . . . . . . . . . . . . . 19

5.1 Output from the four methods on data without syntactic information 30
5.2 Output from the four methods on data has been labelled and parsed 31
5.3 Output from the four methods and our combined method . . . . . . . 33
1
Chapter 1
Introduction
1.1 Definitions
Firth [12] defines the concept of collocation as an abstract syntax, not directly
related to the meaning of words that constitute it. Choueka [7] said that the concept
of collocation is a sequence of two or more consecutive words which has the charac-
teristics of a syntactic unit meaning, and its meaning could not be inferred directly
from the meaning of words components. According to Benson [1], a collocation is
a combination of the fixed and repeated words. Thus, Firth paid attention to the
lexical of collocation, and Choueka tends to research aspects of syntactic function of
collocation in the text. The definition of Benson is one of the most commonly-used,
but it ignores a number of features and attributes of collocation applications in ma-
chine translation. For example, collocation could not be translated from English into
Vietnamese word by word.
Collocation, for instance, is an expression of two or more words that correspond
to a conventional way of saying things. They are also known as a class of word groups
which lie between idioms and free word combination [5]. However, it is typical to
draw a line between a phrase and a collocation. Idioms and phrase may be defined as
expression in the language that is peculiar to itself either grammatically or especially
in having a meaning that cannot be derived from the sum of the meanings of its
elements. It becomes well impossible to guess the meaning of an idiom from the
words it contains. And, moreover, the meanings that idioms have are often stronger
than the meanings of non-idiomatic phrases.
Many studies of collocation in English have been conducted, but there is no stan-
dard definition of collocation to be made, and the definition of collocation depends
2

1.2. Related works and motivation 3
on the point and purpose of each of these studies
In this thesis, we accept this definition: collocation is a combination of words
that often appear together in the normal range in the text, position and grammatical
relations are relatively fixed.
1.2 Related works and motivation
A good example of the type of problem is Halliday’s example of strong vs.
powerful tea (Halliday 1966: p150). It is a convention in English to talk about strong
tea, not powerful tea, although any speaker of English would also understand the
latter unconventional expression. The combination of words that do not follow a
rule of grammar or semantics is definition of collocations. Thus, one collocation can
be interpreted as a combination of the words which do not follow a rule of grammar
or semantics at all. In some points of view, collocations are fixed and inflexible. The
meaning of a collocation is not usually inferred from the meaning of words into parts,
and replacing a word with one component of synonyms can completely change the
meaning of the collocation. Collocations are also understood as idiosyncratic prag-
matics combination of lexical items (Fontenelle, 1992, p222): heavy rain, light breeze,
great difficulty, grow steadily, meet requirement, reach consensus, pay attention, ask
a question. Unlike idioms (kick the bucket, lend a hand, pull someone’s leg), their
meaning is fairly transparent and easy to decode. Different from the regular produc-
tions, (big house, cultural activity; read a book) collocations expressions are highly
idiosyncratic, since the lexical items a headword combines with in order to express
a given meaning is contingent upon that word (Mel’cuk, 2003).
As it has been pointed out by many researchers (Cruse, 1986; Benson, 1990;
McKeown and Radev, 2000), collocations cannot be described by means of general
syntactic and semantic rules. They are arbitrary and unpredictable, and therefore
need to be memorized. They constitute the so-called semi-finished products of lan-
guage (Hausmann, 1985) or the islands of reliability (Lewis, 2000) on which the
speakers build their utterances.
In addition, collocation is a special problem of linguistic. Syntax imposes con-

straints on word order or the occurrence of particular phrasal types such as PPs
or NPs, and lexical semantics imposes. Joachim Wermter and Udo Hahn [37] intro-
duced a linguistic measure for identifying PP-verb collocations in German, which is
based on the property of non- or limited modifiability.
Due to their popularity that there are a large number of collocation extraction
4 Chapter 1. Introduction
word concerns the English language: (Choueka, 1988; Church et al, 1989; Church
and Hanks, 1990; Smadja, 1993; Justeson and Katz, 1995; Kjellmer in 1994, Sinclair
in 1995; Lin, 1998), among many others. Choueka (1988) provides methods to detect
n-grams (consecutive) simply by calculating the co-occurrence frequency. Justeson
and Katz (1995) apply a POS-filter on the pair of their extraction. (Kjellmer 1994
[16]). Smadja (1993) uses the z-score associated with multiple diagnostic (e.g., the
presence of two systems of lexical items at the same distance in the text) and extracts
predicative collocations, rigid noun phrases and phrasal templates. He then uses the
parser to validate the results. Parsing is shown to lead to an increase in accuracy
from 40% to 80%. (Church et al, 1989) and (Church and Hanks, 1990) using POS
information and parsed to extract verb-object pairs, then they are ranked according
to the mutual information (MI) measure. Lin (1998)[18, 20, 19] also proposes a
hybrid approach based on a dependency parser. The candidate is extracted then
compared with MI result.
In the document production tasks such as machine translation [1, 24, 34] and
natural language processing [5, 13, 32, 38, 23], collocations also presented the impor-
tance. Furthermore, they are useful in a variety of other applications, such as word
sense disambiguation (Brown et al, 1991) and parsing (Alshawi and Carter, 1994).
Collocation is particularly important because the incidence in the native language,
in all the areas or categories. According to Jackendoff (1997, 156) and Mel’Cuk
(1998, 24), a large number of collocations appeared in the vocabulary of a language.
The past decade has witnessed a considerable development of collocation extraction
techniques that concerns both monolingual (parallel) and multilingual corpora. We
can mention here only a part of this work: (Berry-Rogghe, 1973; Church et al., 1989;

Smadja, 1993; Lin, 1998; Krenn and Evert, 2001) for monolingual extraction, and
(Kupiec, 1993; Wu, 1994; Smadja et al., 1996; Kitamura and Mat-sumoto, 1996;
Melamed, 1997) for bilingual extraction via alignment.
In the first paper on fuzzy decision making Raj Kishor Bisht and H.S.Dhami [2]
suggest a way to check the possibility whether a word combination can be considered
as collocation or not. Fuzzy logic allows the formation of a logic based model by
utilizing the reasoning behind the existing methods. The resulting model has the
simplicity of the logic based model and performs better than the existing statistical
models.
In the study of collocation, German is the second most investigated language.
The first is the study of Breidt (1993) and more recently, Krenn and Evert, such
as (Krenn and Evert in 2001; Evert and Krenn, 2001 Evert 2004). Breidt, using MI
1.2. Related works and motivation 5
and t-score and compares accuracy results when different parameters change, such
as window size, the presence compared with absence of lemmatization, corpus size,
and the presence compared with absence of POS and syntactic information. Then,
Krenn and Evert (2001) used a German chunk-er to extract the pair syntax as P-N-
V. Their work set the basis of formal methods and the pricing system in collocation
extraction. Zinsmeister and Heid (2003, 2004) focused on combining NV and ANV
determined using a stochastic parser.
Thanks to the outstanding work of Gross on lexicon-grammar (1984), French is
one of the languages most studied on the distribution and conversion capabilities
of the word. This work was done before the computer era and the advent of cor-
pus linguistics, while the automatic extraction was then performed, for example, in
(Lafon, 1984; Daille in 1994 ; Bourigault in 1992, Goldman et al, 2001).
There are also a number of methods to extract collocation studies in other lan-
guages [25, 3, 27, 30]. For over 20 years ago, the field of natural language processing
has achieved many accomplishments (such as labelling grade, topic detection, or re-
covery information) [26, 33, 29, 21]. However, most of these were made for Western
languages and their value is lost when applied to other languages. Not until now,

Vietnamese researchers are attracted by linguistics and Vietnamese standard grade.
The necessary data warehouse terms not built in a certain standard, and so far al-
most no resources are public. It is difficult for amateurs to learn or research in this
field.
Nguyen Cam Tu [35] (about discovery scheme for classification and clustering
web documents in Vietnamese), has given the label based on N-gram testing to
extract meaningful phrases (or collocation) from the n-gram on the basis of test
statistics. This paper gives a few names of statistical methods to determine colloca-
tion, such as the mutual information (mutual information), the technical hypothesis
testing (hypothesis testing technologies), Hypothesis Null (null hypothesis) on the
independent of the n-gram from the ways of testing and to test the validity of the-
ories in which the author has used methods of hypothesis testing for n-gram (n <=
2), based on when the Chi-Square to find the collocation. Chi-Square values are cal-
culated from a large data set (data Vnexpress (199MB) and Wikipedia (270MB) in
about 200 subjects), and are based on a threshold value to determine the collocation
(which authors called coloThreshold).
6 Chapter 1. Introduction
1.3 Contribution of the thesis
It was found that the studies on the extracting collocation for English has gone
to a lot of investigation; however, the study on extracting collocations in Vietnamese
is still a relatively new field. Not much research has been conducted and the results
are still very limited. This literature focuses on the application of some statistical
methods to extract collocation in the Vietnamese language, studying the effects of
pretreatment of the extracted text, comparing the accuracy of the test model; we
propose a combination of methods to improve the accuracy of the program. Our
goals are below :
• To investigate Vietnamese collocations: details on definitions, characteristics,
classification, and some applications in machine translation of collocations and
problems of natural language processing.
• To present some method of extracting collocations based on statistics. More

specifically, within the limits of this thesis, we will delve into four methods: the
method based on frequency, two methods of testing theories and methods based
on mutual information. For each method, from presenting relevant theoretical
basis, we presented how to apply them to solve the problems in the Vietnamese
collocations extraction, some experimental models, results and evaluation of
the application four methods to extract collocations in Vietnamese.
• To propose a method of combining statistical, syntactic information and a
linguistic measure for identifying PP-verb and NPs collocations. From the
present theoretical basis, we develop empirical models, evaluate results and
the accuracy of the program based on this method.
The following describe content of the chapter:
Chapter 2 presents an overview of collocation, the characteristics, classification
and application of collocation, introduces the concept of collocation in Vietnamese.
Chapter 3 explores classical statistical methods. Chapter 4 presents collocation ex-
traction method that we proposed. This chapter includes the construction of empir-
ical models on the statistical methods for Vietnamese and combined method using
information of syntax and a linguistic measure. Chapter 5 presents the results of
experimental models which have been proposed. Chapter 6 describe our conclusion
Chapter 2
Collocation: concept, roles and
applications
Because the study of collocations for Vietnamese-language has limitations in
terms of quantity and quality, the concept of collocations is less familiar to many
people, even those who do research in the field of natural language processing. This
chapter presents a general introduction to collocation and contact the Vietnamese
language, help readers understand more about collocations and the necessity of
building a system to extract Vietnamese collocations. More specifically, we consider
these key questions:
• What are collocations?
• Characteristics of a collocation? How many types of collocations?

• Which must be extracted collocations for?
• Concept of Vietnamese collocation and extraction Vietnamese collocations.
2.1 Collocations’ characteristics
According to the definition outlined earlier, a collocation has four main
features:
7
8 Chapter 2. Collocation: concept, roles and applications
2.1.1 Recurrent
The appearance together of the word which created the collocation in a
document is not a special case. They are used repeatedly in a certain con-
text. Phrases like to make a decision, to hit a record, to perform an operation
is the common collocation in English text, or HIV / AIDS, chuyển dịch cơ
cấu, học hỏi kinh nghiệm is the common collocation in Vietnamese text, and
phrases such as to buy short, to ease the jib or vaccine, kiểm thử phần mềm is
the collocation specific areas of expertise. Both types of collocation are used
repeatedly in other contexts.
2.1.2 Arbitrary
In a sense, the collocation’s meaning has idiomatic, or fixed phrase. Common
definition of a collocation could not be directly inferred from the meaning of
words constituting it. In most cases, a collocation could not be translated
word by word style from one language into another language. For example,
we can translate the phrase open door in Vietnamese into English, German
easily, but could translated word by word the phrase cạnh tranh gay gắt from
Vietnamese into English or German. A Vietnamese learners could not easily
use the phrase cạnh tranh gay gắt if they do not know the meaning of the
phrase before. Translating a text from one language into another language not
only requires knowledge of the rules of grammar and semantics as collocations
with rigidity, warehouse bilingual corpus of collocations is essential for an
application effective machine translation.
2.1.3 Domain-dependent

In professional writing, there are many collocations. The terminology is
often less familiar to those who do not research and study in that field. In
addition, there are words familiar to the reader but they have completely dif-
ferent meaning in the specialized text. For example, in information technology
such as kỹ nghệ phần mềm, xử lý bundle, tài nguyên hệ thống entirely new
words for those who study in the social, economic or another. Besides, there
are many phrases that do not contain specialized terminology but its meaning
is not familiar to people outside of the majors. For example, in English text,
2.2. Classification of collocations 9
a dry suit is not a dry suit, which is a special type of clothing to help the
sailors did not get wet in the extreme weather conditions. Indigenous people
are often unaware of the rigidity of the collocation in the regular text, however,
the rigidity of the collocation in the text can also cause major difficulties for
them.
2.1.4 Non-substitutability (the closely linked in terms
of vocabulary)
We usually could not replace a component of the collocation by its synonyms,
because the alternative can completely change the original meaning of the
phrase. The nature of the collocation is often used by practitioners and when
compiling a dictionary collection of collocations (Cowie[10]; Benson [1]). The
practiser and compiling a dictionary based on the idea of language of others to
decide what are collocation phrases and words that is not a collocation. They
collect information in the form of the questionnaire, each question had been
removed a word. The word disability can easily be answered by the natives,
while with the languages learner which it is not simple. Therefore, collocation
has its own probability distribution (Halliday [6]; Cruse [11]). In other words,
for example, the probability phrase red herring appear consecutively in the text
area will be greater than the probability of occurrence of red with a probability
of occurrence of herring; or they could not be regarded as two of which are
two independent random variables. Based on this idea, we developed a set of

methods to select and identify collocation extracted from the large corpus of
data based on statistics.
2.2 Classification of collocations
The linguists and the compilations of the dictionary have conducted many
studies to provide a classification system for collocations. One classification
system was based on the relationship between the components. Accordingly,
there are two types of collocation, they are collocation based on relations of
grammar and collocation related semantics. Collocations based on grammar
relationship often include prepositions, the structure verb form + preposition
(for examples come to, put on), adjective + preposition (as afraid of, fond of )
10 Chapter 2. Collocation: concept, roles and applications
and noun + preposition (egg, by accident, witness to). Collocations which are
semantically related pairs of words are limited in terms of vocabulary.
Another classification system is favoured to the structure of the collo-
cation. Accordingly, there are two types of collocations: collocations are the
compounds and the collocations have structure more flexible. Collocation that
is the compound has pairs of words that appear consecutively in the text, and
with fixed function syntax. Noun + noun phrases are examples of such type
of collocation. The collocation is the pair of flexible word includes a subject
and verb forms, and the distance they can be (or appear interrupt word).
We favour an approach which draws a line between collocation and free
word combination on the semantic layer [37], the compositionality between the
components of a linguistic expression. For this purpose, there are three classes
of collocation based on varying degrees of semantic compositionality of the
basic lexical entities involved:
2.2.1 Idiomatic Phrases
In this case, none of the lexical components involved contribute to overall
meaning in semantically transparent way. The meaning of the expression is
metaphorical or figurative. The idiomatic phrases contain one, several, or no
empty seats. If gaps exist, the phrase pattern for determining the label of the

words may be added to that space.
2.2.2 Support Verb Construction
The second class contains expression in which at least one component contributes
to the overall meaning in a semantically transparent way and thus constitutes its
semantic core. This type of collocation is the most flexible structure. They are often
appear together repeatedly with a certain number of grammatical structures. For
example: Hostile-takeover, make-Decision. Table 2.2 illustrates some collocations
related predicate in Vietnamese.
2.2.3 Fixed Phrases
The collocations include the noun phrase terms in specific fields. The noun
phrase’s meaning could not be inferred from the meaning of the word component.
2.3. Applications 11
Figure 2.2: The collocation has Support Verb Construction
For example, stock market, Foreign Exchange, New York Stock Exchange, The Dow
Jones Industrials average of 30.
Figure 2.3 illustrates some of the collocation form a fixed noun phrase in Viet-
namese.
Figure 2.3: Fixed noun phrase
2.3 Applications
Collocations exist in a lot of text. The concept of collocation is not only the
phrases in the text adjacent, but also the idiomatic phrase, the terminology. There
are two main issues which are the rigidity and inseparable in meaning between
the phrases. There are phrases, no errors on grammatical, no errors or violations
of rules, but they are not considered to be true, or not accepted, simply because
the natives do not speak like that. This problem is the cause of the difficulties
12 Chapter 2. Collocation: concept, roles and applications
that beginners encounter when they learn a language. Therefore, the extraction of
collocation could help the language learners to get used to using words and word
combinations by native speakers. A second issue related to the collocation we want to
mention is the problem related to the definition of collocation. As mentioned above,

the definition of a collocation is not usually derived directly from the definition
of the word component. This characteristic has important influence to a machine
translation system. The request of the user for each machine translation system is
the target text to achieve a precision and as fluency as possible. Using the method of
translation from a collocation of words to translate from one language into another
language not only reduces the accuracy of the system, but also affects the degree of
fluency of the target text. Therefore, a program may be able to identify collocations,
and updates on bilingual collocation dictionaries not only increases the accuracy
of the program but also the nature of the text. In addition, warehouse bilingual
corpus of collocation is benefit to the program of social language and many other
applications.
2.4 Vietnamese collocations
Like other languages, Vietnamese also has many collocations. For instance,
we must use rửa rau to describe the action wash vegetables but we can not use
rửa gạo to describe the action wash rice before cooking. The right phrase is vo
gạo. According to the dictionary translation in English - Vietnamese, collocation
means "an arrangement in place, the placement order." In the field of language,
collocation can be understood like "(a) use the word, (a) incorporating the word".
In Vietnamese, there is a concept very close to the meaning of collocation, which
is a fixed phrase [4]. The fixed phrase is a number of word combined, existing as
a unit is available as word, it has semantic constituents and stability as well as
word. Definition of fixed phrase has developed and organized in a way that the
organization of the phrase, and are generally iconic. Therefore, if only based on the
surface, on the meaning of each constituent is generally could not understand the
whole phrase. For example: anh hùng rơm, đồng không mông quạnh, tiếng bấc tiếng
chì. Furthermore, fixed phrase mean as a whole corresponds to a body structure
of its material. This means that it has very high expression, for example, the fixed
phrase: rán sành ra mỡ, méo miệng đòi ăn xôi vò, say như điếu đổ the expression
is the fullest extent. The fixed phrase should be distinguished from the neighbouring
units, they are easily confused with compound words and free phrases. If accepted

2.4. Vietnamese collocations 13
as a temporary name that is not immediately identify their conceptual content, it
can be summed up in one of the classification picture Vietnamese fixed phrase as
follows [8]:
Figure 2.4: Some type of Vietnamese fix phrase
The classification of Vietnamese fixed phrase above is not worked out the ab-
solute boundaries between these categories, and not the units in each category are
shown the properties of pure type. There are intermediate unit is formed by the way
of free expression, less stable still crisp. There are those who have achieved the high
expression, but the durability and the body of the structure are low.
The concept of collocation and Vietnamese fixed phrase are very close together,
but for the problem in extracted Vietnamese collocation, collocation is understood
more broadly than the fixed phrase. Derived from the characteristics of collocation
(phrases including two or more words appearing together frequently), the problem
in extracted Vietnamese collocation becomes the problem of extracting n-gram in-
cluding many word appearing frequently with each other. Collocation in problem in
extracted Vietnamese collocation include compound words, phrases, fixed phrases,
or even free phrase if they are present with great frequency in the corpus.
Chapter 3
Basic methods in Collocation
extraction
Some classical methods in the study of collocation is the approach of the practice
and compiled dictionaries. According to Benson and Morton [1], the component of
collocation could not be separated and handled independently. Therefore, the pro-
cess of extracting selected collocation is not a pattern exists, but must be extracted
manually selected, and added to the dictionary. In recent years, the approaches based
on statistics have been applied in the study of collocation extraction. This is the
result of the fact that there is more and more large corpus of data that computer
can understand. Chouka [7] has developed programs that automatically extract col-
location selected from text using n-gram from 2 to 6 words. A simple method to

determine the collocation in the corpus is based on the frequency of appearances.
If two or more words often appear together, they can completely make collocation.
However, n-grams with the highest frequency of sometimes are not a collocation.
For example, we consider the bigram in the corpus data as of the, in the, to the, etc.
To solve this problem, Justeson and Katz [14] give a method based on experience
to improve the accuracy of the program, by the bigram pass through a filter based
on the labels of categories. This filter only passes through the N-gram structure
determination. Some models are used to along as AN, NN, AAN, and ANN, with a
corresponding adjective, N corresponds to the term. Although these methods based
on the experience are rather simple but they have significantly improved the ac-
curacy of the program. The extracting methods based on the frequency have been
used quite effectively for the fixed noun phrase. However, it does not really work for
the collocation that has a more flexible structure or collocation that contains the
14
3.1. Frequency 15
separated components. The method of hypothesis testing based on mutual infor-
mation is given to improve this situation. However, each method has strengths and
weaknesses points, and depending on the used data. In the rest of this chapter, we
go into detail about the four classical methods based on statistics used in extracting
collocation: method based on frequency, t-test, Chi-squared, and methods of using
mutual information.
3.1 Frequency
This method is based on the assumption: collocation is a combination of words
that often appear together in the text. If two words appear together several times
over a certain threshold, then it can be seen that they relate to each other, and may
be treated as collocation. However, the precision of this method is limited. We can
improve this method by giving the phrase a bi-gram pass through a filter. This filter
is based on the label of words in input phrases, and only words which are probably a
phrase could pass through. Justeson and Katz [14] provide the template of English
phrase. Table 3.1 illustrates the label used for English proposed by Justeson and Katz

[15]. However, in Vietnamese adjectives usually go after to modifier to the noun, and
location of verbs, adjectives and prepositions in sentences are different from English.
We propose a model of the language label for Vietnam as in figure 3.1. In this
model, A represents the adjective, preposition representing P and N represent nouns.
When conducting comparative experimental results, extracting bi-gram following
the models available significantly improved the accuracy of the program extracting
based on frequency.
Table 3.1: Sample label type for filter of English
A N Linear function
N N Regression coefficients
A A N Gaussian random variable
A N N Cumulative distribution function
N A N Mean squared error
N N N Class probability function
N P N Degree of freedom
In particular, A: adjective, N: nouns and P: preposition This is the simplest
method to extract collocations in the text. However, this method requires a large
16 Chapter 3. Basic methods in Collocation extraction
Figure 3.1: Sample label type for filter of Vietnamese
data set and the accuracy of the program depends on the size of the data corpus.
In addition, it only extracts the collocation from a fixed pair.
Figure 3.2: Some collocations extracted by frequency
3.2 Hypothesis testing
In many cases, two words can occur together randomly and thus there is no
collocation. For such cases, we could not apply the approach based on the frequency,
hypothesis testing methods are invoked. The method of hypothesis testing is used
to accept or reject the null hypothesis. In the problem collocations extraction, hy-
pothesis testing helps us determine if two words appear together randomly, or it is
a collocation. Initial hypothesis H
0

states that there is no connection in the appear-
ance of these words. From this null hypothesis, we determined the event to occur
3.2. Hypothesis testing 17
if H
0
were true. The probability P: when the event occurs and the H
0
is true and
reject H
0
if P is too low (usually P <0.05, 0.01, 0005 or 0001) and retain H
0
in other
cases.
3.2.1 T-Test
t-Test is a method commonly used in hypothesis testing. In t-Tests, the prob-
ability distribution of the w
i
surrounding the root w is assumed to follow normal
distribution. Null hypothesis is a sample that has average distribution µ, T-Tests
consider the differences between the average value of the sample and the average
value of its normal distribution. If t is greater than a certain threshold t
0
, null hy-
pothesis H
0
is accepted, by contrast, H
0
is rejected. The value t is calculated using
the formula:

t =
x − µ

σ
2
N
In particular x is the sample value (= count (w
1
, w
2
)/N), µ is the average dis-
tribution (in this problem, we consider µ = P (w
1
w
2
), σ
2
is the sample variance
(= p(1 − p) ≈ p)(with very small p) and N is the sample size.
After having completed the value of t, we search the table of t distribution with
is the corresponding deviation. If t larger than the value of t
0
corresponding to the
deviation α determined, we can remove the hypothesis H
0
with precision (1 − α).
For example, applying t-test: Our null hypothesis is stated as follows: The av-
erage height of male is 158 cm. We reviewed a sample of high-dimensional index of
200 men, with x = 169 and σ
2

= 2600 and we want to determine if the sample has
been taken from the files of that population above, in other words it has complied
with the empty theory. The value of t is calculated as follows:
t =
169 − 158

2600
200
≈ 3.05
The table of t values corresponding to the precision = 0.005, we could see that
the value t
0
= 2.576. Since t = 3.05 > 2.576 = t
0
, we can reject null hypothesis with
99.5% accuracy. Therefore, the sample is not taken from the files of the population
above, and the accuracy of the test up to 99.5%.
To illustrate the use of t-Test in collocations extraction, we calculate t for new
companies. We considered the data corpus is a sequence of N bi-grams, and the
sample is a set of random variables corresponding to each bi-gram, the value by 1
when the bi-gram appears in the corpus, and the value 0 in otherwise.
18 Chapter 3. Basic methods in Collocation extraction
In our corpus, new appearance was 15.828, companies appear 4675 times, and
14.307.668 bi-grams. Probability for new and the companies shall be calculated as
follows:
P (new) =
15828
14307668
P (company) =
4675

14307668
Null hypothesis is defined that the new and companies appear independent of
each other.
Or:
H
0
: P (new company) = P (new)P (company) =
15828
14307668
×
4675
14307668
If the null hypothesis is true, the process of randomly generated pairs of bi-grams
and assign the value 0 when a bi-gram generated is new companies and 0 in other
cases follow the Bernoulli distribution withp = 3.615 × 10
−7
which is probability of
bi-gram be new companies. The average distribution value: µ = 3.615 × 10
−7
and
deviationσ
2
= p(1 − p) ≈ p (the p value is very small). In the data are reviewed,
new companies appear eight times, all bi-grams are 14.307.668. Therefore, in the
data corpus considered, we have the average value x =
8
14307668
≈ 5.591× 10
−7
. From

the probability value we calculate the value t for the pair from the new companies:
t =
x − µ

σ
2
N
=
5.591 × 10
−7−3.615×10
−7

3.615 × 10
−7
14307668
≈ 0.999932
Because t = 0.999932 <= 2.576 = t
0
, with precision α = 0.005, so we could not
reject null hypothesis that the new companies appear independent of each other and
do not constitute collocation.
3.2.2 Chi-Square
The use of T-Test method encountered limitations because it assumes the proba-
bility is distributed equally, but in fact, this condition is hardly satisfied. Therefore,
the Chi-squared are made. In the simplest case, this method is applied to two from
the corresponding 2x2 table in Table 3.2. This test is to compare the frequency
observed in the frequency table with the expected value. If the deviation between
frequency of the expected and observed frequencies are large, we can reject null
hypothesis of independence.
Table 3.2.2 shows the frequency of new and companies in the corpus.

C(new) = 15.828, C(Companies) = 4.675, C(new Companies) = 8 and 14.307.668
bi-grams. Chi-squared index is calculated by squaring the total effect of the value
of each cell (i, j) with its expected value divided by the expected value. Specifically,
3.2. Hypothesis testing 19
Figure 3.3: Some collocations extracted by the method T-Test
Table 3.2: Example using Chi-square
w
1
= new w
1
= new
w
2
= companies 8 4667
w
2
= companies 15820 14287181
it is determined by the formula:
χ = Σ
i,j
(O
i,j
− E
i,j
)
2
E
i,j
where i is the row index and j is the column index, N is the sample size, E
i,j

is
the expected value in cell (i,j). With a 2× 2 table, E
ij
= (E
i1
+ E
i2
)(E
2j
+ E
1j
)/N.
Chi-square can be applied to all size table; 2x2 tables have a simple formula to
calculate the Chi-squared value as follows:
χ =
N(O
11
O
22
− O
12
O
21
)
2
(O
11
+ O
12
)(O

11
O
22
)(O
12
O
22
)(O
21
O
22
)
Following this formula, the value of Chi-square in Table 3.2 could calculate like
that:
14307668(8 × 14287181 − 4667 × 15820)
2
(8 + 4667)(8 + 15820)(4667 + 14287181)(15820 + 14287181)
≈ 1.55
Search the table we find α = 0.05 corresponds to χ = 3.841 > 1.55, so we could
not reject null hypothesis that the new and the companies appear independent of
each other. Companies or new could not be a collocation. Overall, in the extract-
ing collocation problem, T-Test method and Pearson Chi-square method have not
20 Chapter 3. Basic methods in Collocation extraction
a great difference on the results. In some cases, the Chi-squared proves better with
the large probability, when assuming normal distribution of T-Test is not satisfied.
Because of that, the Chi-squared tests are often more common than T-test in collo-
cation extraction. Figure 3.4 illustrates some of the results obtained when applying
the method of Chi square test in extracting collocation.
Figure 3.4: Some collocations extracted by the method Chi-Square
3.3 Point-wise Mutual Information (PMI)

Church and Hanks [9] defines a collocation as a pair of words appearing together
by accident in the text. The method of collocations extraction based on mutual
information derived from this definition. At the two words x and y, with probabilities
corresponding appear as P (x) and P (y), the mutual information I (x, y) of two words
is determined by: I(x, y) = log
P (x, y)
P (x).P (y)
Mutual information helps us determine
the level of information depends on two elements x, y. In information theory, mutual
information is generally defined as information obtained from random variables, but
it is not the values of random variables as we define here. Fano defined mutual
information as: "The amount of information obtained from the occurrence of events
3.3. Point-wise Mutual Information (PMI) 21
is represented by [y ’] for the occurrence of events is represented by [x’]." For example,
measuring the mutual information shows the amount of information we have about
the emergence of the Ayatollah at position i in the corpus of data increases 18.38 bits
if we know Ruhollah appear in position i +1. Or, information about the appearance
of Ruhollah at position i +1 in the corpus increased 18.38 bits if we know the
Ayatollah appeared at position i. We can also say that the uncertainty of our reduced
18.38 bits. In other words, we can be sure that Ruhollah will appear at the next
position if we know that the Ayatollah is the word in considering. It was found that
the mutual information reflects quite well the independence between two events. The
value of mutual information asymptotically 0 shows two independent events but a
mutual information value is greater than 0 does not really reflect the dependency
between two variables because the dependencies depends heavily on frequency of
two events. In other words, two words which have the value of mutual information
big is not necessarily collocation. One solution to this problem is the threshold for
a frequency separation greater than a threshold value. However, this solution has
not really solved the problem, but mitigated its effects. A further restriction of this
method is that it relies on the assumption that the two words form the collocation

have interdependent relationships, the result usually includes the phrase that is
not collocations but related about meaning to each other(e.g doctor-nurse, doctor-
dentist). As noted above, mutual information does not really reflect the ability to
Figure 3.5: Some collocations extracted by the method PMI
be the collocation of two words (x, y), so the method extracting collocations based
22 Chapter 3. Basic methods in Collocation extraction
on mutual information often exists only in research in theory and not usually used
in practical applications. Figure 3.5 illustrates some of the collocations extracted by
the method using mutual information.

×