Tải bản đầy đủ (.docx) (64 trang)

semantic similarity in vietnamese

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.1 MB, 64 trang )

VIETNAM NATIONAL UNIVERSITY, HA NOI
UNIVERSITY OF ENGINEERING AND TECHNOLOGY

Nguyen Tien Dat
FINDING THE SEMANTIC SIMILARITY IN
VIETNAMESE
GRADUATION THESIS
Major Field: Computer Science
Ha Noi – 2010
VIETNAM NATIONAL UNIVERSITY, HA NOI
UNIVERSITY OF ENGINEERING AND TECHNOLOGY

Nguyen Tien Dat
FINDING THE SEMANTIC SIMILARITY IN
VIETNAMESE
GRADUATION THESIS
Major Field: Computer Science
Supervisor: Phd. Phạm Bảo Sơn
Ha Noi – 2010
Finding the semantic similarity in Vietnamese Nguyen Tien Dat
Abstract
Our thesis shows the quality of semantic vector representation with random
projection and Hyperspace Analogue to Language model under about the researching
on Vietnamese. The main goal is how to find semantic similarity or to study synonyms
in Vietnamese. We are also interested in the stability of our approach that uses
Random Indexing and HAL to represent semantic of words or documents. We build a
system to find the synonyms in Vietnamese called Semantic Similarity Finding
System. In particular, we also evaluate synonyms resulted from our system.
Keywords: Semantic vector, Word space model, Random projection, Apache Lucene
2
Finding the semantic similarity in Vietnamese Nguyen Tien Dat


Acknowledgments
First of all, I wish to express my respect and my deepest thanks to my advisor
Pham Bao Son, University of Engineering and Technology, Viet Nam National
University, Ha Noi for his enthusiastic guidance, warm encouragement and useful
research experiences.
I would like to gratefully thank all the teachers of University of Engineering
and Technology, VNU for their invaluable knowledge which they provide me during
the past four academic years.
I would also like to send my special thanks to my friends in K51CA class, HMI
Lab.
Last, but not least, my family is really the biggest motivation for me. My
parents and my brother always encourage me when I have stress and difficulty. I
would like to send them great love and gratefulness.

Ha Noi, May 19, 2010
Nguyen Tien Dat
3
Finding the semantic similarity in Vietnamese Nguyen Tien Dat
Contents
4
Finding the semantic similarity in Vietnamese Nguyen Tien Dat
5
Finding the semantic similarity in Vietnamese Nguyen Tien Dat
6
Chapter 1: Introduction Nguyen Tien Dat
Chapter 1
Introduction
Finding semantic similarity is an interesting project in Natural Language
Processing (NLP). Determining semantic similarity of a pair of words is an important
problem in many NLP applications such as: web-mining [18] (search and

recommendation systems), targeted advertisement and domains that need semantic
content matching, word sense disambiguation, text categorization [28][30]. There is
not much research done on semantic similarity for Vietnamese, while semantic
similarity plays a crucial role for human categorization [11] and reasoning; and
computational similarity measures have also been applied to many fields such as:
semantics-based information retrieval [4][29], information filter [9] or ontology
engineering [19].
Nowadays, word space model is often used in current research in semantic
similarity. Specifically, there are many well-known approaches for representing the
context vector of words such as: Latent Semantic Indexing (LSI) [17], Hyperspace
Analogue to Language (HAL) [21] and Random Indexing (RI) [26]. These approaches
have been introduced and they have proven useful in implementing word space model
[27].
In our thesis, we carry on the word space model and implementation for
computing the semantic similarity. We have studied every method and investigated
their advantages and disadvantages to select the suitable technique to apply for
Vietnamese text data. Then, we built a complete system for finding synonyms in
Vietnamese. It is called Semantic Similarity Finding System. Our system is a
combination of some processes or approaches to easily return the synonyms of a given
7
Chapter 1: Introduction Nguyen Tien Dat
word. Our experimental results on the task of finding synonym are promising. Our
thesis is organized as following. First, in Chapter 2, we introduce the background
knowledge about word space model and also review some of the solutions that have
been proposed in the word space implementation. In the next Chapter 3, we then
describe our Semantic Similarity Finding System for finding synonyms in Vietnamese.
Chapter 4 describes the experiment we carry out to evaluate the quality of our
approach. Finally, Chapter 5 is conclusion and our future work.
8
Chapter 1: Introduction Nguyen Tien Dat

9
Chapter 2. Background Knowledge Nguyen Tien Dat
Chapter 2
Background Knowledge
2.1 Lexical relations
The first section, we describe the lexical relations to clear the concept of
synonym as well as hyponymy. Relations lexical concepts are difficult to define a
common way. It is given by Cruse (1986) [35]. A lexical relation is a culturally
recognized pattern of association that exists between lexical units in a language.
2.1.1 Synonym and Hyponymy
The synonymy is the equality or at least similarity of the importance of
different linguistic. Two words are synonymous if they have the same meaning [15].
Words that are synonyms are said to be synonymous, and the sate of being a synonym
is called synonymy. For the example, in the English, words “car” and “automobile”
are synonyms. In the figurative sense, two words are often said to be synonyms if they
have the same extended meaning or connotation.
Synonyms can be any part of speech (e.g. noun, verb, adjective or pronoun) as
the two words of a pair are the same part of speech. More examples of Vietnamese
synonyms:
độc giả - bạn đọc (noun)
chung quanh – xung quanh (pronoun)
bồi thường – đền bù (verb)
an toàn – đảm bảo (adjective)
10
Chapter 2. Background Knowledge Nguyen Tien Dat
In the linguistics dictionary, the synonym is defined in three concepts:
1. A word having the same or nearly the same meaning as another word or
other words in a language.
2. A word or an expression that serves as a figurative or symbolic substitute
for another.

3. Biology: A scientific name of an organism or of a taxonomic group that has
been superseded by another name at the same rank.
In the linguistics: Hyponym is a word whose meaning is included in that of other word
[14].
Some examples in English: “scarlet”, “vermilion”, and “crimson” are hyponyms of
“red”.
And in Vietnamese: “vàng cốm”, “vàng choé” and “vàng lụi” are hyponyms of “vàng”,
in case, “vàng” is in color.
In our thesis, we don’t distinguish clearly between synonym and hyponym. We
suppose the hyponym is a kind of synonym.
2.1.2 Antonym and Opposites
In the lexical semantics, opposites are the words that are in a relationship of
binary incompatibles in opposite as: female-male, long-short and to love – to hate. The
notion of incompatibility refers to the fact that one word in an opposite pair entails that
it is not the other pair member. The concept of incompatibility here refers to the fact
that a word in a pair of facing demands that it not be a member of another pair. For
example, “something that is long” entails that “it is not short”. There are two members
in a set of opposites, thus it is referred to a binary relationship. The relationship
between opposites is determined as opposition.
Opposites are simultaneously similar and different in meanings [12]. Usually,
they differ in only one dimension of meaning, but are similar in most other aspects,
which are similar in grammar and semantics of the unusual location. Some words are
11
Chapter 2. Background Knowledge Nguyen Tien Dat
non-opposable. For example, animal or plant species have no binary opposites or
antonyms. Opposites may be viewed as a special type of incompatibility:
An example, incompatibility is also found in the opposite pairs “fast - slow”
It’s fast. - It’s not slow.
Some features of opposites are given by Cruse (2004): binary, inheritress, and
patency. In this section, we introduced Antonyms are gradable opposites. They located

at the opposite end of a continuous spectrum of meaning.
Antonym is defined:
“A word which having a meaning opposite to that of another word is called
antonym. “[12]
It has also been commonly used with concepts synonyms. More antonym
examples in Vietnamese:
đi – đứng
nam – nữ
yêu – ghét
………
Words can have some different antonyms, depending on the meaning or
contexts of words. We study antonyms to make clearly a fundamental part of a
language, in contrast to synonyms.
2.2 Word-space model
Word-space model is an algebraic model to represent text documents or any
objects (phrase, paraphrase, term …). It uses a mathematical model as vector to
identify or index terms in the text documents. Model is useful in information retrieval,
information filter, indexing. The invention can be traced at the Salton's introduction
12
Chapter 2. Background Knowledge Nguyen Tien Dat
about Vector space Model for information retrieval [29]. This term is due to Hinrich
Schutze (1993):
“Vector similarity is the only information present in Word Space:
semantically related words are close, unrelated words are distant.
(page.896) “
2.2.1 Definition
Word-space models contain the related method for representing concepts in a
high dimensional vector space. In this thesis, we suggest a name: Semantic vector
model through our work. The models include some well-known approach such as:
Latent Semantic Indexing [17], Hyperspace Analogue to language [21].

Document and queries are performed as vectors.
( )
jt,jjj
w,,w,w=d
2,1,
( )
qt,qq
w,,w,w=q
2,1,
Each dimension corresponds to a separate term. If document does not include
the term, term's value in the vector is zero. In contract, if a term occurs in the
documents, its value is non-zero. There are many ways to compute above values, but
we study one famous way that has been developed. That is tf-idf weighting (see the
part of section below):
The core principle is that semantic similarity can be represented as a proximate
n-dimensional vector; n can be 1 or the large number. We consider the 1-dimensional
and 2-dimensional word space in the Figure:
13
Chapter 2. Background Knowledge Nguyen Tien Dat
Figure 2.1: Word geometric repersentaion
In above geometric representation, it shows the simple words of some
Vietnamese. As an example, both semantic spaces, “ô_tô” is the closer meaning to
“xe_hơi” than “xe_đạp” and “xe_máy”.
The definition of term depends on each application. Typically terms are single
words, or longer phrases. If words are chosen to be terms, the dimensionality of the
vector is the number of words in the vocabulary.
2.2.2 Semantic similarity
As we have seen in the definition, the word-space model is a model of semantic
similarity. On the other hand, the geometric metaphor of meaning is Meanings are
locations in a semantic space, and semantic similarity is

proximity between the locations. The term-document
vector represents the context of term in low
granularity. Besides, creating term vector according to
the some words surrounding to compute semantic
vector [21]. It is a kind of semantic vector model. To
compare the semantic similarity in semantic vector
model, we use Cosine distance:
14
Figure 2.2: Cosine distance
Chapter 2. Background Knowledge Nguyen Tien Dat
In practice, it is easier to calculate the cosine of the angel between the vectors
instead of angle:
A cosine value of zero means that the query and document vector does not
exist and match. The higher Cosine distance is; the closer similarity of semantic of two
terms is.
2.2.3 Document-term matrix
A document-term matrix and term-document matrix are the mathematical
matrices that show the frequency of terms occurings in a set of collected documents. In
a document-term matrix, rows correspond to documents in the collection and columns
correspond to terms. In a term-document matrix, rows correspond to words or terms
and columns correspond to documents. To determine value of these matrices, one
scheme is tf-idf .
A simple example for document-term matrix:
D1 = “tôi thích chơi ghita.”
D2 = “tôi ghét ghét chơi ghita.”
Then the documents-term matrix is:
Tôi thích ghét chơi ghita
D1 1 1 0 1 1
D2 1 0 2 1 1
Matrix shows how many times terms appear in the documents. And in detail,

complexity, we describe the tf-idf in the next part.
15
Table 2.1: A example of documents-term matrix
Chapter 2. Background Knowledge Nguyen Tien Dat
2.2.4 Example: tf-idf weights
In the classic semantic vector model [31], the term weights in the documents
vectors are products of local and global parameters. It is called term frequency-inverse
document frequency model.
The weight document vector is measured by ,
where
And tft,d is term frequency of term t in document d
is inverse document frequency. |D| is the number of
documents. is the number of documents in which the term t occurs.
Distance between document dj and query q can calculated as:
.
16
Chapter 2. Background Knowledge Nguyen Tien Dat
2.2.5 Applications
Over 20 years, the semantic model has been developed strongly, it's useful to
perform many important tasks of natural language processing. Such applications, in
which semantic vector models play a great role, are:
Information retrieval [7]: It is basic foundation to create applications that are fully,
automatically and widely applicable on different languages or cross-languages. The
system has flexible input and output options. Typically, user queries any combination
of words or documents while system return about documents or words. Therefore, it is
very easy to build web interface for users. Regarding cross-language information
Retrieval, semantic vector models is more convenient than other systems to query in
one language that matches relevant documents or articles in the same or other
languages because it is fully automatic corpus analysis while Machine translation
requires vast lexical resources. Some machine translations are very expensive to

develop and lack coverage to all lexicon of a language.
Information filters [9]: It is very interested. Information Retrieval needs relative
stable database and depend on user queries while Information filter (IF) finds
relatively stable information needs. The data stream in IF is rapidly changing. IF also
use more techniques such as: information routing, text categorization or classification.
Word sense discrimination and disambiguation [28]: The main idea is clustering the
weighted sum of vector for words found in a paragraph of text; it is called the context
vector of word. It calculates the co-occurrence matrix too (see in section 2.2), the
appearance of an ambiguous word can then be mapped to one of these word-senses.
Document segmentation [3]. Computing the context vector of region text help
category this text belongs to a kind of documents. Given a document, system can show
that it is some kinds of sport, policy or law topic.
Lexical and ontology acquisition [19]: According to the knowledge of a few given
words called seed words and their relationship to getting many other similar words that
distance of semantic vector is nearby.
17
Chapter 2. Background Knowledge Nguyen Tien Dat
2.3 Word space model algorithms
In this section, we will discuss a common space model algorithm. There are two
kinds of approach to implement Word Space model: Probabilistic approaches and
Context vector; but we pay attention to context vector approach, the common way is
used to compute the semantic similarity. We introduce Word-concurrence Matrices
that represent for context vector of word. Then, we study some similarity measure to
calculate the distance between two context vectors.
2.3.1 Context vector
Context plays an important role in NLP. The quality of contextual information
is heavily dependent on the size of the training corpus with less data available,
extracting contextual information for any given phenomenon becomes less reliable
[24]. But not only training corpus, extracting semantic depends on algorithms we use.
There are many methods to extract context from data set, but results is often very

different.
Formally, a context relation is a tuple (w, r, w) where w is a headword
occurring in some relation type r, with another word w in one or more the sentences.
Each occurrence extracted from raw text is an instance of a context, that is, a context
relation/instance is the type/token distinction. We refer to the tuple (r, w) as an
attribute of w.
The context instances are extracted from the raw text, counted and stored in
attribute vectors. Calculating attributes vectors can give some factor to compare
context of word, then deduce semantic similarity.
2.3.2 Word-concurrence Matrices
The approach developed by Schutze and Qiu & Frei has become standard
practice for word-space algorithms [16]. The context of a word is defined as the rows
18
Chapter 2. Background Knowledge Nguyen Tien Dat
or columns of co-occurrence Matrix, data is collected in a matrix of co-occurrence
counts.
Definition
Formally, the co-occurrence matrix can be a words-by-words matrix which is a
square W x W matrix, where W corresponds to the number of unique words in the free
text corpus is parsed (in Vietnamese, we need word segmentation process before next
job). A cell m
i,j
is the number of times word w
i
co-occurs or appear with word w
j
within a specific context – a natural unit such as the sliding window of m words. Note
that we process the upper and lower word before performing this.
Words Co-occurrence
tôi (lãng mạn 1), (kiểm tra 1), (nói 1)

sinh viên (bồi thường 1), (lãng mạn 1), (kiểm tra), (nói 1)
bồi thường (sinh viên 1), (nhân viên 1)
lãng mạn (tôi 1),(sinh viên 1)
nhân viên (bồi thường 1), (kiểm tra 1),
kiểm tra (tôi 1),(sinh viên 1), (nhân viên 1)
nói (tôi 1),(sinh viên 1)
Other co-occurrence matrix is defined words-by-documents matrix W x D,
where D is the number of document in the corpus. A cell f
i,j
of this matrix shows the
frequency of appearance of word W
i
in the document j. A simple example for words-
by-words co-occurrence matrix as following:
Word Co-occurrences
tôi sinh viên bồi thường lãng mạn nhân
viên
kiểm tra nói
tôi 0 0 0 1 0 1 1
sinh viên 0 0 1 1 0 1 1
bồi thường 0 1 0 0 1 0 0
lãng mạn 1 1 0 0 0 0 0
nhân viên 0 0 1 0 0 1 0
kiểm tra 1 1 0 0 1 0 0
nói 1 1 0 0 0 0 0
19
Table 2.2: Word co-occurrence table
Table 2.3: Co-occurrences Matrix
Chapter 2. Background Knowledge Nguyen Tien Dat
Both documents and windows will be used to compute semantic vectors, but it

is easy to estimate the quality of windows (high granularity) is greater than low
granularity that is documents context.
Instantiations of the vector-space model
It is developed by Gerald Salton and colleagues in 1960s on the SMART
information-retrieval system [13]. Nowadays, many information retrieval systems
perform the both type of weight. One is in the traditional vector space model; it is
introduced by Robertson & Sparck Jones 1997. The other is known as TFIDF family
of weighting schemes . This is true for semantic vector algorithms that use a words-by-
documents co-occurrence matrix.
Words-by-words co-occurrence matrix count the number of times the word i
co-occurs which other word j. The co-occurrence is often counted in a context window
or context size spanning some number of words. When we count in both directions
(one to left & one to right, two words to left & two words to right …) of target word,
the matrix is called symmetric words-by-words co-occurrence matrix, and in which the
rows equals the columns. In this thesis, we use the context size direct to the both side
of target word in our experiments. We estimate different results on a few context size.
However, if we counted in the only one direction (left or right side), it is called
directional words-by-words co-occurrence matrix. We can refer to the former as a left-
directional words-by-words matrix, and to the latter as a right-directional words-by-
words matrix. The above table describes the right right-directional words-by-words
matrix for the example data, the row and the column vectors of word are different. The
row vector contains co-occurrence counts with words that have occurred to the right of
the words, while the column vector contains co-occurrence counts with words that
have occurred to their left side.
20
Chapter 2. Background Knowledge Nguyen Tien Dat
2.3.3 Similarity Measure
This section describes how to define or evaluate word similarity. The similarity
measure is often based on the similarity between two. Measuring semantic similarity
involves devising a function for measuring the similarity between context vectors. The

context extractor returns a set of context relations with their instance frequencies. It's
can be represented in nested form (w, (r,w')).
To gain high-level comparison, we use measuring functions such as: Geometric
Distances, Information Retrieval, Set Generalizations, Information Theory and
Distributional Measures [24]. But we only do research on Geometric Distances
because it is very popular and easy to understand. We define the similarity between
words w1 and w2 by some Geometric Distances:
Euclidean distance:
Manhattan or Liechtenstein distance:
Others:
21
Chapter 2. Background Knowledge Nguyen Tien Dat
2.4 Implementation of word space model
In this section, we will discuss some problems with word space algorithms:
high dimensional matrix and data sparseness. Then, look at how different
implementations of the word space model, we also discuss some their advantages or
disadvantages.
2.4.1 Problems
High dimensional matrix
This is a great problem in the building the co-occurrences matrix process. When
writing an algorithm for the word-space model, the choice of vector space similarity is
not a unique design for us to choose. Another important issue is how to handle high
dimensional data for the context vector.
At the same time, if you do not have enough data, we will not have a platform
to build models of word distribution. At the same time, the co-occurrence matrix will
become prohibitively large for any reasonably sized data affecting the scalability and
efficiency of the algorithm. This led them to a delicate dilemma: We need much data
to build co-occurrence matrix, but they will be limited by high dimensional
computation capacity matrix.
Data sparseness

Other problem in creating the vector in the word-space model is a simple case,
there are many cells in the co-occurrence matrix will equal to zero. This is called data
sparseness in which, the co-occurrence events (of two words or word in document) are
possible in the matrix will actually occur, regard of data size. The vast majority of
words only occur in a very limited number of contexts with other words. This theory is
well known, associate to the general Zipf’s law [34]. To solve the data sparseness
problem and the high dimension as well, the mathematical computing matrix is
introduced through below dimension reduction techniques.
22
Chapter 2. Background Knowledge Nguyen Tien Dat
Dimension reductions
The solution of high dimensions problem is always called dimensionality
reduction. A matrix, build for representing term or document, has high dimensions.
There are many methods to restructure high-dimensional data in a low-dimensional
space, so that both the dimensionality and sparseness of the data are decreased. Thus,
it is very easy to compute or compare among between context vectors.
In this thesis, we introduce only one approach to do this; it’s called Singular
Value Decomposition [10]. SVD method is used especially in numerical
mathematics. There be, for example many linear systems in solving computationally
reasonable accuracy.
Some modern image compressions are based on an algorithm, image (matrix of
color values) in an SVD. This is a possible application of reduction model.
In particle physics one uses the singular value decomposition to diagnose mass
matrices of Dirac particles. The singular values give the masses of the particles in their
mass eigenstate. From the transformation matrices U and V one constructs as the
CKM matrix, which express the mass eigenstates of particles from a mixture of flavor
eigenstates that can exist.
The singular value decomposition need of complexity, where n
is the number of documents + number of terms and k is the number of dimensions.
In additional, we introduce other technique to slave the high dimensionality.

That is Latent Dirichlet Allocation (LDA). It is proceed by Blei et all 2003. LDA
provides a probabilistic generative model that the number for documents as being
probabilistic mixtures of the underlying topics [2]. Then, model use an EM algorithm
to evaluate the k topic parameter and that calculate the model for each documents
order by k, where N is the number of words in the document.
23
Chapter 2. Background Knowledge Nguyen Tien Dat
2.4.2 Latent semantic Indexing
Latent Semantic Indexing is a useful method of information
retrieval [7]. Techniques such as LSI are particularly relevant to the search on large
data sets such as the document on the Internet. The goal of LSI is to major
components of documents to be found. These principal components (concepts) can be
thought of as general concepts. For example in English, Building is such a concept that
term such as hose, tower to be included. Thus, this method is suitable, for example, for
very many documents, articles (such as on the Internet), those which reference where
it comes to cars, even if the word auto in them explicitly not occur. In
addition, LSI can help, finding articles, where it is about cars, to be distinguished from
those in which only the word car is mentioned for example at sites where a car is
introduced as a profit.
The semantics, which is determined by LSI, is calculated and displayed in a
matrix, which is called in this case semantic space. A matrix is a table to be entered
into the multi-dimensional, LSI semantic relationships. Newly added content must be
included, which constantly requires new calculations. In the process of LSI, the
dimensions of a matrix can be reduced, since semantically related contents are grouped
and categorized. Because of the reduced matrix, calculations are simplified. The
question is how far one should reduce the dimensions.
Background Mathematical
LSI implements the Termfrequenz Matrix [20] by the singular
value approximated. It is a method to reduce dimension of matrix to the semantic units
of a document carried out, further simplifies the calculation.

LSI is just an additional procedure based on the vector space retrieval touches
down. From there the famous TD-matrix is processed by the LSI in addition to zoom
out to. This is useful in particular for larger document collections, since the TD-
matrices are generally very large. This matrix is the TD on the decomposing of
singular value decomposition. This helps decrease complexity in computing and thus
retrieval process (comparison of the documents or queries) to save on computing time.
24
Chapter 2. Background Knowledge Nguyen Tien Dat
At the end of the algorithm is a new, smaller TD-matrix in which the terms of
the original TD-matrix are generalized into concepts.
Algorithm
The main step of latent semantic indexing
• The term-document matrix is calculated and, where appropriate, weighted
• The term-document matrix A is then decomposed into three components
(singular):

The two orthogonal matrices U and V included here eigenvectors
of A
T
A A A or
T,
S A is a diagonal matrix with the roots of the eigenvalues
of A,
T,
also called singular values
• About the eigenvalues in the resulting matrix S can now control the dimension
reduction. This is done by successively omitting each of the smallest eigenvalue
up to an indefinite limit k
• Search query q (for Query) to edit an order, it will be displayed in the semantic
space. Q this is a special case of a document the size considered. With

the following formula is the (possibly weighted) Queryvektor q shown:
S
k
is the first k diagonal elements of S.
• Each document is as shown in q the semantic space. After that, q, for example,
the cosine similarity or the scalar product of the document will be compared
with.
25

×