HIDDEN TOPIC DISCOVERY TOWARD CLASSIFICATION AND CLUSTERING IN VIETNAMESE WEB DOCUMENTS

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.22 MB, 67 trang )

VIET NAM NATIONAL UNIVERSITY, HANOI
COLLEGE OF TECHNOLOGY

NGUYEN CAM TU

HIDDEN TOPIC DISCOVERY TOWARD
CLASSIFICATION AND CLUSTERING IN
VIETNAMESE WEB DOCUMENTS

MASTER THESIS

HANOI - 2008

VIET NAM NATIONAL UNIVERSITY, HANOI
COLLEGE OF TECHNOLOGY

NGUYEN CAM TU

HIDDEN TOPIC DISCOVERY TOWARD
CLASSIFICATION AND CLUSTERING IN
VIETNAMESE WEB DOCUMENTS

Major: Information Technology
Specificity: Information Systems
Code: 60 48 05

MASTER THESIS

SUPERVISOR: Prof. Dr. Ha Quang Thuy

HANOI - 2008
i
Acknowledgements
My deepest thank must first go to my research advisor, Prof. Dr. Ha Quang Thuy, who
offers me an endless inspiration in scientific research, leading me to this research area. I
particularly appreciate his unconditional support and advice in both academic
environment and daily life during the last four years.
Many thanks go to Dr. Phan Xuan Hieu who has given me many advices and comments.
This work can not be possible without his support. Also, I would like to thank him for

being my friend, my older brother who has brought me a lot of lessons in both scientific
research and daily life.
My thanks also go to all members of seminar group “data mining”. Especially, I would
like to thank Bsc. Nguyen Thu Trang for helping me a lot in collecting data and doing
experiments.
I highly acknowledge the invaluable support and advice in both technical and daily life of
my teachers, my colleagues in Department of Information Systems, Faculty of
Technology, Vietnam National University, Hanoi
I also want to thank the supports from the Project QC.06.07 “Vietnamese Named Entity
Resolution and Tracking crossover Web Documents”, Vietnam National University,
Hanoi; the Project 203906 “`Information Extraction Models for finding Entities and
Semantic Relations in Vietnamese Web Pages'' of the Ministry of Science and
Technology, Vietnam; and the National Project 02/2006/HĐ - ĐTCT-KC.01/06-10
“Developing content filter systems to support management and implementation public
security – ensure policy”
Finally, from bottom of my heart, I would specially like to say thanks to all members in
my family, all my friends. They are really an endless encouragement in my life.

Nguyen Cam Tu
ii
Assurance
I certify that the achievements in this thesis belong to my personal, and are not copied
from any other’s results. Throughout the dissertation, all the mentions are either my
proposal, or summarized from many sources. All the references have clear origins, and
properly quoted. I am responsible for this statement.
Hanoi, November 15, 2007
Nguyen Cam Tu
iii
Table of Content
Introduction 1

Chapter 1. The Problem of Modeling Text Corpora and Hidden Topic Analysis 3
1.1. Introduction 3
1.2. The Early Methods 5
1.2.1. Latent Semantic Analysis 5
1.2.2. Probabilistic Latent Semantic Analysis 8
1.3. Latent Dirichlet Allocation 11
1.3.1. Generative Model in LDA 12
1.3.2. Likelihood 13
1.3.3. Parameter Estimation and Inference via Gibbs Sampling 14
1.3.4. Applications 17
1.4. Summary 17
Chapter 2. Frameworks of Learning with Hidden Topics 19
2.1. Learning with External Resources: Related Works 19
2.2. General Learning Frameworks 20
2.2.1. Frameworks for Learning with Hidden Topics 20
2.2.2. Large-Scale Web Collections as Universal Dataset 22
2.3. Advantages of the Frameworks 23
2.4. Summary 23
Chapter 3. Topics Analysis of Large-Scale Web Dataset 24
3.1. Some Characteristics of Vietnamese 24
3.1.1. Sound 24
3.1.2. Syllable Structure 26
3.1.3. Vietnamese Word 26
3.2. Preprocessing and Transformation 27
3.2.1. Sentence Segmentation 27
iv
3.2.2. Sentence Tokenization 28

3.2.3. Word Segmentation 28
3.2.4. Filters 28

3.2.5. Remove Non Topic-Oriented Words 28
3.3. Topic Analysis for VnExpress Dataset 29
3.4. Topic Analysis for Vietnamese Wikipedia Dataset 30
3.5. Discussion 31
3.6. Summary 32
Chapter 4. Deployments of General Frameworks 33
4.1. Classification with Hidden Topics 33
4.1.1. Classification Method 33
4.1.2. Experiments 36
4.2. Clustering with Hidden Topics 40
4.2.1. Clustering Method 40
4.2.2. Experiments 45
4.3. Summary 49
Conclusion 50
Achievements throughout the thesis 50
Future Works 50
References 52
Vietnamese References 52
English References 52
Appendix: Some Clustering Results 56

v
List of Figures
Figure 1.1. Graphical model representation of the aspect model in the asymmetric (a) and
symmetric (b) parameterization. ( [55]) 9

Figure 1.2. Sketch of the probability sub-simplex spanned by the aspect model ( [55]) 10
Figure 1.3. Graphical model representation of LDA - The boxes is “plates” representing
replicates. The outer plate represents documents, while the inner plate represents the
repeated choice of topics and words within a document [20] 12

Figure 1.4. Generative model for Latent Dirichlet allocation; Here, Dir, Poiss and Mult
stand for Dirichlet, Poisson, Multinomial distributions respectively 13

Figure 1.5. Quantities in the model of latent Dirichlet allocation 13
Figure 1.6. Gibbs sampling algorithm for Latent Dirichlet Allocation 16
Figure 2.1. Classification with Hidden Topics 20
Figure 2.2. Clustering with Hidden Topics 21
Figure 3.1. Pipeline of Data Preprocessing and Transformation 27
Figure 4.1. Classification with VnExpress topics 33
Figure 4.2 Combination of one snippet with its topics: an example 35
Figure 4.3. Learning with different topic models of VnExpress dataset; and the baseline
(without topics) 37

Figure 4.4. Test-out-of train with increasing numbers of training examples. Here, the
number of topics is set at 60topics 37

Figure 4.5 F1-Measure for classes and average (over all classes) in learning with 60
topics 39

Figure 4.6. Clustering with Hidden Topics 40
Figure 4.7. Dendrogram in Agglomerative Hierarchical Clustering 42
Figure 4.8 Precision of top 5 (and 10, 20) in best clusters for each query 47
Figure 4.9 Coverage of the top 5 (and 10) good clusters for each query 47
vi
List of Tables
Table 3.1. Vowels in Vietnamese 24
Table 3.2. Tones in Vietnamese 25
Table 3.3. Consonants of hanoi variety 26
Table 3.4. Structure of Vietnamese syllables 26

Table 3.5. Functional words in Vietnamese 29
Table 3.6. Statistics of topics assigned by humans in VnExpress Dataset 29
Table 3.7. Statistics of VnExpress dataset 30
Table 3.8 Most likely words for sample topics. Here, we conduct topic analysis with 100
topics 30

Table 3.9. Statistic of Vietnamese Wikipedia Dataset 31
Table 3.10 Most likely words for sample topics. Here, we conduct topic analysis with 200
topics 31

Table 4.1 Google search results as training and testing dataset. The search phrases for
training and test data are designed to be exclusive 34

Table 4.2. Experimental results of baseline (learning without topics) 38
Table 4.3. Experimental results of learning with 60 topics of VnExpress dataset 38
Table 4.4. Some collocations with highest values of chi-square statistic 44
Table 4.5. Queries submitted to Google 45
Table 4.6. Parameters for clustering web search results 46
vii
Notations & Abbreviations

Word or phrase Abbreviation
Information Retrieval IR
Latent Semantic Analysis LSA
Probability Latent Semantic Analysis PLSA
Latent Dirichlet Allocation LDA
Dynamic Topic Models DTM
Correlated Topic Models CTM
Singular Value Decomposition SVD
1

Introduction
The World Wide Web has influenced many aspects of our lives, changing the way we
communicate, conduct business, shop, entertain, and so on. However, a large portion of
the Web data is not organized in systematic and well structured forms, a situation which
causes great challenges to those seeking for information on the Web. Consequently, a lot
of tasks enabling users to search, navigate and organize web pages in a more effective
way have been posed in the last decade, such as searching, page rank, web clustering, text
classification, etc. To this end, there have been a lot of successful stories like Google,
Yahoo, Open Directory Project (Dmoz), Clusty, just to name but a few.
Inspired by this trend, the aim of this thesis is to develop efficient systems which
are able to overcome the difficulties of dealing with sparse data. The main motivation is
that while being overwhelmed by a huge amount of online data, we sometimes lack data
to search or learn effectively. Let take web search clustering as an example. In order to
meet the real-time condition, that is the response time must be short enough, most of
online clustering systems only work with small pieces of text returned from search
engines. Unfortunately those pieces are not long and rich enough to build a good
clustering system. A similar situation occurs in the case of searching images only based
on captions. Because image captions are only very short and sparse chunks of text, most
of the current image retrieval systems still fail to achieve high accuracy. As a result, much
effort has been made recently to take advantage of external resources like learning with
knowledge-base support, semi-supervised learning, etc. in order to improve the accuracy.
These approaches, however, have some difficulties: (1) constructing a knowledge base is
very time-consuming & labor-intensive, and (2) the results of semi-supervised learning in
one application cannot be reused in another one even in the same domain.
In the thesis, we introduce two general frameworks for learning with hidden topics
discovered from large-scale data collections: one for clustering and another for
classification. Unlike semi-supervised learning, we approach this issue from the point of
view of text/web data analysis that is based on recently successful topic analysis models,
such as Latent Semantic Analysis, Probabilistic-Latent Semantic Analysis, or Latent
Dirichlet Allocation. The underlying idea of the frameworks is that for a domain we

collect a very large external data collection called “universal dataset”, and then build the
learner on both the original data (like snippets or image captions) and a rich set of hidden
topics discovered from the universal data collection. The general frameworks are flexible
2
and general enough to apply for a wide range of domains and languages. Once we analyze
a universal dataset, the resulting hidden topics can be used for several learning tasks in the
same domain. This is also particularly useful for sparse data mining. Sparse data like
snippets returned from a search engine can be expanded and enriched with hidden topics.
Thus, a better performance can be achieved. Moreover, because the method can learn with
smaller data (the meaningful hidden topics rather than all unlabeled data), it requires less
computational resources than semi-supervised learning.
Roadmap: The organization of this thesis is follow
Chapter 1 reviews some typical topic analysis methods such as Latent Semantic Analysis,
Probabilistic Latent Semantic Analysis, and Latent Dirichlet Allocation. These models
can be considered the basic building blocks of general framework of probabilistic
modeling of text and be used to develop more sophisticated and application-oriented
models, such as hierarchical models, author-role models, entity models, and so on. They
can also be considered key components in our proposals in subsequent chapters.
Chapter 2 introduces two general frameworks for learning with hidden topics: one for
classification and one for clustering. These frameworks are flexible and general enough to
apply in many domains of applications. The key common phrase between the two
frameworks is topic analysis for large-scale collections of web documents. The quality of
the hidden topic described in this chapter will much influence the performance of
subsequent stages.
Chapter 3 summarizes some major issues for analyzing data collections of Vietnamese
documents/Web pages. We first review some characteristics of Vietnamese which are
considered significant for data preprocessing and transformation in the subsequent
processes. Next, we discuss more details about each step of preprocessing and
transforming data. Important notes, including specific characteristics of Vietnamese are
highlighted. Also, we demonstrate the results from topic analysis using LDA for the

clean, preprocessed dataset.
Chapter 4 describes the deployments of general frameworks proposed in Chapter 2 for 2
tasks: search result classification, and search result clustering. The two implementations
are based on the topic model analyzed from a universal dataset like shown in chapter 3.
The Conclusion sums up the achievements throughout the previous four chapters. Some
future research topics are also mentioned in this section.
3
Chapter 1. The Problem of Modeling Text Corpora and Hidden
Topic Analysis
1.1. Introduction
The goal of modeling text corpora and other collections of discrete data is to find short
description of the members of a collection that enable efficient processing of large
collections while preserving the essential statistical relationships that are useful for basis
tasks such as classification, clustering, summarization, and similarity and relevance
judgments.
Significant achievements have been made on this problem by researchers in the context of
information retrieval (IR). Vector space model [48] (Salton and McGill, 1983) – a
methodology successfully deployed in modern search technologies - is a typical approach
proposed by IR researchers for modeling text corpora. In this model, documents are
represented as vectors in a multidimensional Euclidean space. Each axis in this space
corresponds to a term (or word). The i-th coordinate of a vector represents some functions
of times of the i-th term occurs in the document represented by the vector. The end result
is a term-by-document matrix X whose columns contain the coordinates for each of the
documents in the corpus. Thus, this model reduces documents of arbitrary length to fixed-
length lists of numbers.
While the vector space model has some appealing features – notably in its basis
identification of sets of words that are discriminative for documents in the collection – the
approach also provides a relatively small amount of reduction in description length and
reveals little in the way of inter- or intra- document statistical structure. To overcome
these shortcomings, IR researchers have proposed some other modeling methods such as

generalized vector space model, topic-based vector space model, etc., among which latent
semantic analysis (LSA - Deerwester et al, 1990)[13][26] is the most notably. LSA uses a
singular value decomposition of the term-by-document X matrix to identify a linear
subspace in the space of term weight features that captures most of variance in the
collection. This approach can achieve considerable reduction in large collections.
Furthermore, Deerwester et al argue that this method can reveal some aspects of basic
linguistic notions such as synonymy or polysemy.
In 1998, Papadimitriou et al [40] developed a generative probabilistic model of text
corpora to study the ability of recovering aspects of the generative model from data in
LSA approach. However, once we have a generative model in hand, it is not clear why we
4
should follow the LSI approach – we can attempt to proceed more directly, fitting the
model to data using maximum likelihood or Bayesian methods.
The probabilistic LSI (PLSI - Hoffman, 1999) [21] [22] is a significant step in this regard.
The pLSI models each word in a document as a sample from a mixture model, where each
mixture components are multinomial random variables that can be viewed as
representation of “topics”. Consequently, each word is generated from a single topic, and
different words in a document may be generated from different topics. Each document is
represented as a probability distribution over a fixed set of topics. This distribution can be
considered as a “reduced description” associated with the document.
While Hofmann’s work is a useful step toward probabilistic text modeling, it suffers from
severe overfitting problems. The number of parameters grows linearly with the number of
documents. Additionally, although pLSA is a generative model of the documents in the
collection it is estimated on, it is not a generative model of new documents. Latent
Dirichlet Allocation (LDA) [5][20] proposed by Blei et. al. (2003) is one solution to these
problems. Like all of the above methods, LDA bases on the “bag of word” assumption –
that the order of words in a document can be neglected. In addition, although less often
stated formally, these methods also assume that documents are exchangeable; the specific
ordering of the documents in a corpus can also be omitted. According to de Finetti (1990),
any collection of exchangeable random variables can be represented as a mixture

distribution – in general an infinite mixture. Thus, if we wish to consider exchangeable
representations for documents and words, we need to consider mixture models that
capture the exchangeability of both words and documents. This is the key idea of LDA
model that we will consider carefully in the section 1.3.
In recent time, Blei et al have developed the two extensions to LDA. They are Dynamic
Topic Models (DTM - 2006)[7] and Correlated Topic Models (CTM - 2007) [8]. DTM is
suitable for time series data analysis thanks to the non-exchangeability nature of modeling
documents. On the other hand, CTM is capable of revealing topic correlation, for
example, a document about genetics is more likely to also be about disease than X-ray
astronomy. Though the CTM gives a better fit of the data in comparison to LDA, it is so
complicated by the fact that it loses the conjugate relationship between prior distribution
and likelihood.
In the following sections, we will discuss more about the issues behind these modeling
methods with particular attention to LDA – a well-known model that has shown its
efficiency and success in many applications.
5
1.2. The Early Methods
1.2.1. Latent Semantic Analysis
The main challenge of machine learning systems is to determine the distinction between
the lexical level of “what actually has been said or written” and the semantic level of
“what is intended” or “what was referred to” in the text or utterance. This problem lies in
twofold: (i) polysemy, i.e., a word has multiple meaning and multiple types of usage in
different context, and (ii), synonymy and semantically related words, i.e, different words
mat have similar sense. They at least in certain context specify the same concept or the
same topic in a weaker sense.
Latent semantic analysis (LSA - Deerwester et al, 1990) [13][24][26] is the well-known
technique which partially addresses this problem. The key idea is to map from the
document vectors in word space to a lower dimensional representation in the so-called
concept space or latent semantic space. Mathematically, LSA relies on singular value
decomposition (SVD), a well-known factorization method in linear algebra.

a. Latent Semantic Analysis by SVD
In the first step, we present the text corpus as term-by-document matrix where elements
(i, j) describes the occurrences of term i in document j. Let X be such a matrix, X will look
like this:
⎥
⎥
⎥
⎦
⎤
⎢
⎢
⎢
⎣
⎡
→
↓
nmm
n
xx
xx
,1,
,11,1
T
i
j
t
d
L
MOM
L

Now a row in this matrix is a vector corresponding to a term, giving its relation to each
document:
[
]
nii
xx
,1,
T
j
t L=

Likewise, a column in this matrix will be a vector corresponding to a document, giving
its relation to each term:
[
]
jmj
T
j
xxd
,,1
L=

Now, the dot product between two term vectors gives us the correlation between the
terms over the documents. The matrix product
pi
tt
T
T
XX

contains all these dot products.
6
Element (i, p) (which equal to element (p,i) due to the symmetry) contains the dot product
. Similarly, the matrix
)tt(tt
ipp
T
i
T
=
X
X
T
contains the dot products between all the
document vectors, giving their correlation over the terms:
j
T
qq
T
j
dddd =
In the next step, we conduct the standard SVD for the
X
matrix and get , where
U and V are orthogonal matrices and the diagonal matrix
Σ contains the
singular values of X. The matrix products giving us the term and document correlations
are then become and respectively.
T
VUX Σ=

IVVUU
TT
==
TTT
VUXX ΣΣ=
TTT
UVXX ΣΣ=
Since and are diagonal we see that
U
must contain the eigenvectors
of
T
ΣΣ ΣΣ
T
T
XX
, while
V
must be the eigenvectors of
X
X
T
. Both products have the same non-zero
eigenvalues, given by the non-zero entries of
T
Σ
Σ , or equally, the non-zero entries of
Σ
Σ
T

.
Now the decomposition looks like this:

The values
l
σ
σ
, ,
1
are called the singular values, and and the left and
right singular vectors. Note that only part of
U
, which contributes to , is the i-th row.
Let this row vector be called . Likewise, the only part of that contributes to is the
j’th column, . These are not the eigenvectors, but depend on all the eigenvectors.
l
uu , ,
1 l
vv , ,
1
i
t
i
t
ˆ
T
V
j
d
j

d
ˆ
The LSA approximation of X is computed by selecting k largest singular values, and their
corresponding singular vectors from U and V. This results in the rank k approximation to
X with the smallest error. The appealing thing in this approximation is that not only does
it have the minimal error, but it translates the terms and document vectors into a concept
space. The vector then has k entries, each gives the occurrence of term i in one of the k
concepts. Similarly, the vector gives the relation between document j and each concept.
We write this approximation as . Based on this approximation, we can now
do the following:
i
t
ˆ
j
d
ˆ
T
kkkk
VUX Σ=
- See how related documents j and q are in the concept space by comparing the
vectors
j
d
ˆ
and
q
d
ˆ
(usually by cosine similarity). This gives us a clustering of the
documents.

7
- Comparing terms i and p by comparing the vectors
i
t
ˆ
and
n
t
ˆ
, giving us a clustering
of the terms in the concept space.
- Given a query, view this as a mini document, and compare it to your documents in
the concept space.
To do the latter, we must first translate your query into the concept space with the same
transformation used on the documents, i.e. and . This means
that if we have a query vector, we must do the translation before comparing it
to the document vectors in the concept space.
jkkj
dUd
ˆ
Σ=
j
T
kkj
dUd
1
ˆ
−
Σ=
qUq

T
kk
1
ˆ
−
Σ=
b. Applications
The new concept space typically can be used to:
- Compare the documents in the latent semantic space. This is useful to some typical
learning tasks such as data clustering or document classification.
- Find similar documents across languages, after analyzing a base set of translated
documents.
- Find relations between terms (synonymy and polysemy). Synonymy and polysemy
are fundamental problems in natural language processing:
o Synonymy is the phenomenon where different words describe the same
idea. Thus, a query in a search engine may fail to retrieve a relevant
document that does not contain the words which appeared in the query.
o Polysemy is the phenomenon where the same word has multiple meanings.
So a search may retrieve irrelevant documents containing the desired words
in the wrong meaning. For example, a botanist and a computer scientist
looking for the word "tree" probably desire different sets of documents.
- Given a query of terms, we could translate it into the concept space, and find
matching documents (information retrieval).
c. Limitations
LSA has two drawbacks:
- The resulting dimensions might be difficult to interpret. For instance, in
{(car), (truck), (flower)} > {(1.3452 * car + 0.2828 * truck), (flower)}
the (1.3452 * car + 0.2828 * truck) component could be interpreted as "vehicle".
However, it is very likely that cases close to
8

}
}
{(car), (bottle), (flower)} > {(1.3452 * car + 0.2828 * bottle), (flower)}
will occur. This leads to results which can be justified on the mathematical level,
but have no interpretable meaning in natural language.
- The probabilistic model of LSA does not match observed data: LSA assumes that
words and documents form a joint Gaussian model (ergodic hypothesis), while a
Poisson distribution has been observed. Thus, a newer alternative is probabilistic
latent semantic analysis, based on a multinomial model, which is reported to give
better results than standard LSA.
1.2.2. Probabilistic Latent Semantic Analysis
Probabilistic Latent Semantic Analysis [21][22] (PLSA) is a statistical technique for
analysis of two-mode and co-occurrence data which has applications in information
retrieval and filtering, natural language processing, machine learning from text and in
related areas. Compared to standard LSA, PLSA is based on a mixture decomposition
derived from a latent class model. This results in a more principled approach which has a
solid foundation in statistics.
a. The Aspect Model
Suppose that we have given a collection of text documents with terms
from a vocabulary . The starting point for PLSA is a statistical model
namely aspect model. The aspect model is a latent variable model for co-occurrence data
in which an unobserved variable
{
N
ddD , ,
1
=
{
M
wwW , ,

1
=
{
}
K
zzZz , ,
1
=
∈
is introduced to capture the hidden
topics implied in the documents. Here, N, M and K are the number of documents, words,
and topics respectively. Hence, we model the joint probability over by the mixture
as follows:
DxW
∑
∈
==
Zz
dzPzwPdwPdwPdPwdP )|()|()|( ),|()(),(
(1.1)
Like virtually all statistical latent variable models the aspect model relies on a conditional
independence assumption, i.e. d and w are independent conditioned on the state of the
associated latent variable (the graphical model representing this is demonstrated in Figure
1.1(a))
9

Figure 1.1. Graphical model representation of the aspect model in the asymmetric (a) and symmetric (b)
parameterization. ( [53])
It is necessary to note that the aspect model can be equivalently parameterized by (cf.
Figure 1.1 (b))

∑
∈
=
Zz
zwPzdPzPwdP )|()|()(),(
(1.2)
This is perfectly symmetric with respect to both documents and words.
b. Model Fitting with the Expectation Maximization Algorithm
The aspect model is estimated by the traditional procedure for maximum likelihood
estimation, i.e. Expectation Maximization. EM iterates two coupled steps: (i) an
expectation (E) step in which posterior probabilities are computed for the latent variables;
and (ii) a maximization (M) step where parameters are updated. Standard calculations
give us the E-step formulae
∑
∈
′
′′′
=
Zz
zwPzdPzP
zwPzdPzP
wdzP
)|()|()(
)|()|()(
),|(
(1.3)
As well as the following M-step equation
∑
∈
∝

Dd
wdzPwdnzwP ),|(),( )|(
(1.4)
∑
∈
∝
Ww
wdzPwdnzdP ),|(),( )|(
(1.5)
∑∑
∈∈
∝
DdWw
wdzPwdnzP ),|(),( )(
(1.6)
c. Probabilistic Latent Semantic Space
10
Let us consider topic-conditional multinomial distribution over vocabulary as
points on the
)|(. zp
1−
M
dimensional simplex of all possible multinomial. Via convex hull, the
K points define a
1−≤
K
L
dimensional sub-simplex. The modeling assumption
expressedby (1.1) is that conditional distributions for all documents are
approximated by a multinomial representable as a convex combination of in

which the mixture component uniquely define a point on the spanned sub-simplex
which can identified with a concept space. A simple illustration of this idea is shown in
)|( dwP
)|( zwP
)|( dzP
Figure 1.2.

Figure 1.2. Sketch of the probability sub-simplex spanned by the aspect model ( [53])
In order to clarify the relation to LSA, it is useful to reformulate the aspect model as
parameterized by (1.2) in matrix notation. By defining ,
and
(
ki
ki
zdPU
,,
)|(
ˆ
=
)
()()
kj
kj
zwPV
,
|
ˆ
=
(
)(

k
k
zPdiag=Σ
ˆ
)
matrices, we can write the joint probability model
P
as a matrix product . Comparing this with SVD, we can draw the following
observations: (i) outer products between rows of
U
and reflect conditional
independence in PLSA, (ii) the mixture proportions in PLSA substitute the singular
values. Nevertheless, the main difference between PLSA and LSA lies on the objective
function used to specify the optimal approximation. While LSA uses or Frobenius
norm which corresponds to an implicit additive Gaussian noise assumption on counts,
PLSA relies on the likelihood function of multinomial sampling and aims at an explicit
maximization of the predictive power of the model. As is well known, this corresponds to
a minimization of the cross entropy or Kullback - Leibler divergence between empirical
distribution and the model, which is very different from the view of any types of squared
deviation. On the modeling side, this offers crucial advantages, for example, the mixture
approximation
T
VUP
ˆ
ˆ
ˆ
Σ=
ˆ
V
ˆ

2
L
P
of the term-by-document matrix is a well-defined probability
11
distribution. IN contrast, LSA does not define a properly normalized probability
distribution and the approximation of term-by-document matrix may contain negative
entries. In addition, there is no obvious interpretation of the directions in the LSA latent
space, while the directions in the PLSA space are interpretable as multinomial word
distributions. The probabilistic approach can also take advantage of the well-established
statistical theory for model selection and complexity control, e.g., to determine the
optimal number of latent space dimensions. Choosing the number of dimensions in LSA
on the other hand is typically based on ad hoc heuristics.
d. Limitations
In the aspect model, notice that is a dummy index into the list of documents in the
training set. Consequently, d is a multinomial random variable with as many possible
values as there are training documents and the model learns the topic mixtures
only for those documents on which it is trained. For this reason, pLSI is not a well-
defined generative model of documents; there is no natural way to assign probability to a
previously unseen document.
d
)|( dzp
A further difficulty with pLSA, which also originate from the use of a distribution
indexed by training documents, is that the numbers of parameters grows linearly with the
number of training documents. The parameters for a K-topic pLSI model are K
multinomial distributions of size V and M mixtures over the K hidden topics. This gives
KV + KM parameters and therefore linear growth in M. The linear growth in parameters
suggests that the model is prone to overfitting and, empirically, overfitting is indeed a
serious problem. In practice, a tempering heuristic is used to smooth the parameters of the
model for acceptable predictive performance. It has been shown, however, that overfitting

can occur even when tempering is used (Popescul et al., 2001, [41]).
Latent Dirichlet Allocation (LDA - which is described in section 1.3. overcomes both of
these problems by treating the topic mixture weights as a K-parameter hidden random
variable rather than a large set of individual parameters which are explicitly linked to the
training set.
1.3. Latent Dirichlet Allocation
Latent Dirichlet Allocation (LDA) [7][20] is a generative probabilistic model for
collections of discrete data such as text corpora. It was developed by David Blei, Andrew
Ng, and Michael Jordan in 2003. By nature, LDA is a three-level hierarchical Bayesian
model in which each item of a collection is modeled as a finite mixture over an
underlying set of topics. Each topic, in turn, modeled as an infinite mixture over an
12
underlying set of topic probabilities. In the context of text modeling, the topic
probabilities provide an explicit representation of a document. In the following sections,
we will discuss more about generative model, parameter estimation as well as inference in
LDA.
1.3.1. Generative Model in LDA
Given a corpus of M documents denoted by
{
}
M
dddD , ,,
21
=
, in which each document
number m in the corpus consists of N
m
words drawn from a vocabulary of terms
, the goal of LDA is to find the latent structure of “topics” or “concepts” which
captured the meaning of text that is imagined to be obscured by “word choice” noise.

Though the terminology of “hidden topics” or “latent concepts” has been encountered in
LSA and pLSA, LDA provides us a complete generative model that has shown better
results than the earlier approaches.
i
w
{
V
tt , ,
1
}
Consider the graphical model representation of LDA as shown in Figure 1.3, the
generative process can be interpreted as follows: LDA generates a stream of observable
words , partitioned into documents
nm
w
,
m
d
r
. For each of these documents, a topic
proportion
m
ϑ
r
is drawn, and from this, topic-specific words are emitted. That is, for each
word, a topic indicator is sampled according to the document – specific mixture
proportion, and then the corresponding topic-specific term distribution
nm
z
,

nm
z
,
ϕ
r
used to draw a
word. The topics
k
ϕ
r
are sampled once for the entire corpus. The complete (annotated)
generative model is presented in Figure 1.4. Figure 1.5 gives a list of all involved
quantities.

Figure 1.3. Graphical model representation of LDA - The boxes is “plates” representing replicates. The outer
plate represents documents, while the inner plate represents the repeated choice of topics and words within a
document [20]
13

Figure 1.4. Generative model for Latent Dirichlet allocation; Here, Dir, Poiss and Mult stand for Dirichlet,
Poisson, Multinomial distributions respectively.

Figure 1.5. Quantities in the model of latent Dirichlet allocation
1.3.2. Likelihood
According to the model, the probability that a word instantiates a particular term t
given the LDA parameters is:
nm
w
,

∑
=
===Φ=
K
k
mnmknmmnm
kzptwptwp
1
,,,
)|()|(),|(
ϑϕϑ
r
r
r
(1.7)
which corresponds to one iteration on the word plate of the graphical model. From the
topology of the graphical model, we can further specify the complete-data likelihood of a
14
document, i.e., the joint distribution of all known and hidden variables given the hyper
parameters:
43421
r
44444448444444476
r
r
4444434444421
rr
r
r
r

r
plate topic
document) (1 platedocument
plate word
,
1
,
)|(.),(.)|()|(),|,,,(
,
βαϑϑϕβαϑ
Φ=Φ
∏
=
ppzpwpzdp
mmnmz
N
n
nmmmm
nm
m
(1.8)
Specifying this distribution is often simple and useful as a basic for other derivations. So
we can obtain the likelihood of a document
m
d
r
, i.e., of the joint event of all word
occurrences, as one of its marginal distributions by integrating out the distributions
m
ϑ

r
and
Φ
and summing over :
nm
z
,
()
∫∫
∏
∑
=
ΦΦ=
m
nm
nm
N
n
z
mmnmznmmm
ddzpwpppdp
1
,,
,
,
|)|().|().|(),|(
ϑϑϕβαϑβα
rr
r
r

r
rr
r
r
(1.9)
()
∏
∫∫
=
ΦΦΦ=
m
N
n
mmnmm
ddwppp
1
,
),|().|(.|
ϑϑβαϑ
rrr
r
r
(1.10)
Finally, the likelihood of the complete corpus
{
}
M
m
m
dW

1=
=
r
is determined by the product of
the likelihoods of the independent documents:
∏
=
=
M
m
m
dpWp
1
).,|(),|(
βαβα
r
r
r
r
r
(1.11)
1.3.3. Parameter Estimation and Inference via Gibbs Sampling
Exact estimation for LDA is generally intractable. The common solution to this is to use
approximate inference algorithms such as mean-field variational expectation
maximization, expectation propagation, and Gibbs sampling [20]
a. Gibbs Sampling
Gibbs sampling is a special case of Markov-chain Monte Carlo (MCMC) and often yields
relatively simple algorithms for approximate inference in high-dimensional models such
as LDA. Through the stationary behavior of a Markov chain, MCMC methods can
emulate high-dimensional probability distributions

)(xp
r
. This means that one sample is
generated for each transition in the chain after a stationary state of the chain has been
reached, which happens after a so-called “burn-in period” that eliminates the influence of
initialization parameters. In Gibbs sampling, the dimensions of the distribution are
i
x
15
r
sampled alternately one at a time, conditioned on the values of all other dimensions,
which we denote . The algorithm works as follows:
i
x
−
1. Choose dimension i (random or by permutation)
2. Sample
i
x from )|(
ii
xx p
−
r
To build a Gibbs sampler, the full conditionals
)|(
ii
xxp
−
r
must be calculated, which is

possible using
(
)
()
{
ii
i
ii
xxx
dxxp
xp
xxp
−−
==
∫
rr
r
}
r
r
, with )|(
(1.12)
For models that contain hidden variables
z
r
, their posterior given the evidence,
(
)
xzp
r

r
|
, is a
distribution commonly wanted. With Eq. 1.12, the general formulation of a Gibbs sampler
for such latent-variable models becomes:
()
(
)
()
∫
=
−
Z
i
ii
dzxzp
xzp
xzzp
r
r
r
r
r
r
,
,
,|
(1.13)
where the integral changes to a sum for discrete variables. With a sufficient number of
samples

[
Rrz
r
,1,
]
~
∈
r
, the latent-variable posterior can be approximated using:
()
(
)
∑
=
−≈
R
r
r
zz
R
xzp
1
~
1
|
rr
r
r
δ
(1.14)

With the Kronecker delta
()
{
}
otherwise 0 ;0 if 1
=
=
uu
r
r
δ

b. Parameter Estimation
Heirich et al [20] has applied the above hidden-variable method to develop a Gibbs
sampler for LDA as shown in Figure 1.6. The hidden variables here are , i.e., the
topics that appear with the words of the corpus . We do not need to include the
parameter sets
nm
z
,
nm
w
,
Θ
and
Φ
because they are just the statistics of the associations between the
observed and the corresponding , the state variables of the Markov chain.
nm
w

,
nm
z
,
Heirich [20] has been shown a sequence of calculations to lead to the formulation of the
full conditionals for LDA as follows:
()
(
)
(
)
()
11
,|
1
,
1
,
−
⎥
⎦
⎤
⎢
⎣
⎡
+
+
−
⎥
⎦

⎤
⎢
⎣
⎡
+
+
=
∑
∑
=
−
=
−
−
z
K
z
z
m
z
z
im
v
V
v
v
z
t
t
iz

ii
n
n
n
n
wzzp
α
α
β
β
r
r
(1.15)
16
The other hidden variables of LDA can be calculated as follows:
(
)
()
v
V
v
v
k
t
t
k
tk
n
n
β

β
ϕ
+
+
=
∑
=1
,
(1.16)
(
)
()
z
K
z
z
m
k
k
m
km
n
n
α
α
ϑ
+
+
=
∑

=1
,
(1.17)
- Initialization
zero all count variables, , ,
()
z
m
n
m
n
(
)
t
z
n ,
z
n
for all documents do
[]
Mm ,1∈

for all words in document do
[
m
Nn ,1∈
]
m
sample topic index ~Mult(1/
K)

nm
z
,
increment document-topic count:
(
)
1+
s
m
n

increment document-topic sum: 1
+
m
n
increment topic-term count:
(
)
1+
t
s
n

increment topic-term sum:
1
+
z
n

end for
end for
]
- Gibbs sampling over burn-in period and sampling period
while not finished do

for all documents do
[]
Mm ,1∈
for all words in document do
[
m
Nn ,1∈
m
- for the current assignment of to a term t for word :
z
nm
w
,
decrement counts and sums:
(
)
1−
z
m
n
;1
−
m
n ;

(
)
1−
t
z
n ;
1
−
z
n

- multinomial sampling acc. To Eq. 1.15 (decrements from previous step):
sample topic index
(
)
wzzpz
ii
r
r
,|~
~
−

- use the new assignment of to the term
t for word to: z
nm
w
,
increment counts and sums:
(

)
1+
z
m
n
r
; ;1+
t
z
n
r
1
+
z
n
r

end for
end for
- check convergence and read out parameters
if converged and L sampling iterations since last read out then
- the different parameters read outs are averaged
read out parameter set
Φ
acc. to Eq. 1.16
read out parameter set
Θ
acc. to Eq. 1.17

end if
end while

Figure 1.6. Gibbs sampling algorithm for Latent Dirichlet Allocation.

HIDDEN TOPIC DISCOVERY TOWARD CLASSIFICATION AND CLUSTERING IN VIETNAMESE WEB DOCUMENTS

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về