Tải bản đầy đủ (.pdf) (58 trang)

Đoán nhận và giải quyết nhập nhằng thực thể tiếng Việt trên môi trường Web

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (33.2 MB, 58 trang )

DAI HOC QUOC GIA HA NOI
DOAN NHAN VA GIAI QUYET NHAP NHANG
THU^C
THE
TIENG VIET TREN MOI
TRI/OfNG
WEB
1
\ r
(Bao cao tong hop de tai nghien curu khoa hoc cap DHQGHN)
Ma so: QC.07.06
Chu nhiem de tai: ThS. Nguyen Cam Tu
DAI HOC QUOC GIA HA
NOI_
TRUNG TAM THONG
TIN THLl
VIEN
ooowoooo^
Ha Noi - 2008
MUC LUC
• •
1.
BANG GIAI THIGH CAC
CHC
VIET
TAT
2
2.
DANH SACH
NHU>JG NGI/CJI
THAM GIA 3


3.
DANH MUC CAC
BANG
SO LIEU 4
4.
DANH MUC CAC HINH 4
5.
TOM
TAT
CAC
KET
QUA
NGHIEN
CUtJ
CHINH CUA DE
TAI
5
5.1.
Ket qua ve khoa hoc 5
5.2. Ket qua phuc vu thuc te 5
5.3.
Ket qua
dao tao 5
6.
BAO
CAO TONG KET 7
6.1.
Dat van
dS
7

6.2. Tong quan cac van de nghien cuu 7
6.3.
Muc tieu va noi dung nghien cuu cua de tai 8
6.3.1.
Muc tieu nghien cuu 8
6.3.2. Noi dung nghien cuu 9
6.4. Dia diem, thoi gian
va
phuang phap nghien cuu 18
6.5.
Ket qua nghien cuu
18
6.5.1.
Cac cong bo lien quan den de tai 18
6.5.2. Ket qua dao tao cua de tai 19
6.5.3.
Ket qua ung dung
ciia
de tai 19
6.6. Thao luan 19
6.7.
K6tluan
19
6.8. Tai
lieu
tham khao 21
7.
PHU
LUC
:

23
1.
BANG GIAI
THiCH
CAC
CHC
VIET TAT
Chjr
viet tat
LDA
HAC
Y nghTa
Latent Dirichlet Allocation
Hierarchical Agglomerative Clustering
3. DANH MUC CAC BANG
S6
LIEU
• •
BSng 1
Danh
sdeh
cac thuc
thi
trong kho du Heu thu nghiem 14

4. DANH MUC CAC
HJNH

Hinh
1

Giai
quy6t
nhap
hhkng
thuc
th6
su dung phuang phap khong gian vector 9
Hinh 2 Mo hinh doan nhan va giai quyet nhap nhang thuc the tieng Viet 12
Hinh 3 PhAn
tir
khoa (term part) va phan chu de (topic part) tuang ung
ciia
mpt tai
li?u
dau vao.
()
day so lan xuat hien cua mot chu de li le thuan vai trong so cua no
trong phan phoi chu
detra
ve 13
Hinh 4 Ket qua thuc nghiem vai tap thuc the thu nhat va khong su dung phan tich
chii
de (Lamda=0.2) 16
Hinh 5 Ket qua thuc nghiem vai tap thuc the thu nhat va
sii
dung phan tich
chii
de
16
Hinh 6 Ket qua xac dinh nhap nhang khong dung phan tich

chii
de an trong tap
gom hai thuc the goc la
hai
cau
thii
Hong Son va Van Quyen 17
Hinh 7 Ket qua xac dinh nhap nhang khi dung phan tich
chii
de an trong tap gom
hai thuc
thS g6c
la hai cau
thii
Hong Son va Van Quyen
(Lamda=0.3) 17
6.
BAO CAO
T6NG
K^T
6.1.
Dat
van de
m
• Doan nhan va giai quydt nhap nhang thuc the (xac dinh dong tham chieu,
khii
nhap nhang thuc the) la qua trinh nham xac dinh nhiing lan de cap den cac thuc
the ciing ten c6 thuc su la noi den
ciing
mot thuc the trong thuc te hay khong

(Kibble va
Deemter,
2000). Vi du, trong doan van ban sau:
John Smith dugc xem xet va chi dinh vao vi tri
chii
tich hgi dong. Trong qua
khii,
ong Smith dugc xem la mot su
lira
chgn hoan hao. Tuy nhien, John, nguai ban tot
ciia
ong ta, khong dugc xem xet vao vi tri nay.
Doan nhan va giai quyet nhap nhang thuc the huang tai viec xac dinh xem John
Smith va "ong Smith" co phai la mot nguai hay khong, va lieu John (a cau 3) co
phai la
dk
cap
dSn
cung mot nguai hay khong. Bai toan nay thuang dugc ma rgng
dk
xac dinh cac tham chieu nhu "ong ta" hay tham chi la "nguai ban tot nhat
ciia
ong ta", tuy nhien a day chiing ta se khong nghien cuu cac truang hgp nay.
Giai
quySt t6t
bai toan nay se gop phan quan trgng cho viec tang chat lugng cac
he thdng tim
kiSm,
trich chgn va
xii

ly cacs tham chieu den
"nhOng
nguai dugc
yeu thich" trong cac ban tin(BNN 2001), hay trong Doan nhan va do
vk chii
de
(Topic Detection and Tracking, Allan 2002).
/• \ ' > •>
Doan nhan va giai quyet nhap nhang thuc the tren nhieu tdi lieu kiem tra xem
nhung
l4n
tham chieu den
ciing
mot ten trong cdc tdi lieu khdc nhau co phai la
tham chiiu
din
ciing mot thuc the hay khong. Bai toan nay tham chi con phiic tap
hon bai toan xac dinh ddng tham chieu tren mot tai lieu vi cac tai lieu thuang dugc
l§y tir
nhiiu
ngu6n
khac nhau, dugc viet bai nhieu tac gia vai nhung qui uoc va
each vilt
khac nhau (Bagag va Baldwin, 1998) hay tham chi la trong cac ngon
ngu*
khac nhau.
6.2. Tong quan cac van de nghien
CLPU
Bagga va Baldwin (1998) trinh bay mot thuat toan cho viec doan nhan va giai
quylt

nhap nhing thuc
thi
trong nhieu tai lieu
sii
dung mo hinh khong gian vector.
7
Nhieu ket qua nghien cuu hien nay dua tren cac ket qua nghien cuu nay
ciia
Bagga
va Baldwin. Mot s6 he thdng nhu NetOwl
ciia
ISOQuest va Textract
ciia
IBM
cung da xdc dinh dugc nhieu ten thuc the tham chieu den cung mpt thuc the nhung
khong
CO
khd nang phan biet cdc thuc
thi
khac nhau nhung co ciing mpt ten.
• TIPSTER Phase III la he thdng ddu tien xac dinh bai todn phan tich ddng tham
chieu tii nhieu tai lieu nhu mot
ITnh
vuc nghien
ciiu
vi no la cdng cu trung tam
ciia
hS
thdng sinh tom tdt
tir

nhieu van ban va viec trdn thdng tin (Bagga va Baldwin,
1998).
Hdi nghi hieu van ban lan
thii
6 (MUC-6)
ciing lihan
dinh tham chieu ddng
vdn bdn
Id
mdt bdi todn cd trien vpng nhung lai khdng dugc dua vao pham vi
ciia
hdi nghi vi nd dugc xem la qua khd .
Cdc van ban tieng Viet tren Web la mdt ngudn tai nguyen vd ciing phong
phii
va
htju
ich tiec rang viec khai thdc ngudn tai nguyen nay cdn het
siic
han che. Trong
nghien cuu nay chiing tdi hudng tai viec xay dung mdt module cho phep dodn
f
\
1
r r
nhan va gidi quyet nhap nhang thuc the tieng Viet
tir
cdc tai lieu tim kiem dugc tra
ve
ciia
mdy tim kiem GOOGLE.

Ndi dung nghien
ciiu ciia
chiing tdi tap trung vao cdc md hinh khdng gian vector ,
cdc phuang phdp thdng ke, hgc ban giam sat va phan cum. Ket qua nghien cuu se
la
CO
sd cho nhung nghien cuu bai toan nay va khai thdc huu hieu han cac tai lieu
tieng Viet tren mdi trudng Web.
6.3. Muc tieu va ngi dung nghien
CLPU
cua de tai
6.3.1.
Muc tieu nghien CLPU
DI
tai cd muc tieu gdp phdn tang cudng nang luc nghien
ciiu,
trien khai
ciia
nhdm
nghien
ciiu
Khai pha
diJ
lieu va
iing
dung tai DHCN theo mdt sd tieu chi nhu sau
\ r
f
y
•>

• Nghien
ciiu
de xuat md hinh giai quyet nhap nhang thuc the tren Web
r r 1
• Xay dung cdc cdng cu tien ich thiet yeu cho phep trien khai cac
iing
dung
diln hinh
ciia
tim kiem hudng thuc the, trich chgn thdng tin tren Internet
• Dao tao can bd nghien cuu chat lugng cao trong khudn khd nghien cuu
ciia
de tai
6.3.2.
N^i
dung nghien CLPU
6.3.2.1.
Cac
phirong
phap doan nhan va giai quyet nhap nhang thirc
the tren Internet
Mdt trong
nhirng
nghien cuu ddu tien vl gidi quylt nhap
nhdng
thuc
thi
tren nhiiu
tai lieu
la nghien

cuu cua
nhdm [Bagga,
Bold
win, 1998]. Phuang phdp
ciia
hg cd
thi
dugc tdm
luge
nhu trong
hinh
ve dudi day:
CoEc^ercDce Unms
for
doc.Ul
UniAiersit)'
of
Pemisvi\'ania's
Pennlight Coreference S)'stem
^oss-Documeac CorefereDce Cfaaius
dDcOl
1
1
dor.l9
1
1
doc.36
1
'
doc.3S

1
dx.zz
'
dDC-20
VSKl-
Disambimate
SeatcDceExtractor
Hinh
1 Giai quyet nhap nhang thirc the sir dung
phvang
phap khong gian vector
Dau vao
ciia
he thdng
Id
mdt tap cdc tai lieu chiia cac thuc
thi
nhap
nhdng.
Trudc
het, tap tai lieu dugc cho qua mdt he thdng xac dinh ddng tham chiiu dan tai lieu.
Dau ra
ciia
budc nay la mdt tap cac danh sach ddng tham chieu, mdi tap tuang
iing
vdi mdi tai lieu. Module trich
riit
cau (Sentence Extractor) tim trong tai lieu cac
\
r

r
f
cau chiia ddng tham chieu nay va sinh ra cac tdm tat tuang
iing.
Cdc tdm tat dugc
bieu dien dudi dang vector trong dd mdi phan
tii ciia
vector
la
trgng sd
ciia
tir khda
tuang ling (trgng sd
ciia
tir cd the dugc tinh theo cac phuang phap nhu
TF,
hay
TF-IDF).
Tiep dd, nhdm tdc gia
sii
dung dp do Cosine de do dp tuong tu giiia cac
cap vector (tuong
iing
vdi cac cap tdm tat). Neu hai tdm tat cd dp tuong tu nhd
han mdt nguong xdc djnh thi chiing dugc xem la ciing ndi ve mdt thuc
thi.
Ddu ra
Id
mdt tap cdc chudi tai lieu, mdi chudi tai lieu nay dugc xem la cung de cap
din

mpt thuc
thi.
Nhdm tac gid tien hanh thu thap
dii
lieu
thii
nghiem bdng cdch tim kiem vdi cdc
r
f
cau truy van dang bieu thiic chinh qui nhu /John * Smith/. Vdi cdch nay, nhdm
•tdc gid da thu thdp dugc mdt tap khodng
173
tdi lieu dl cap
din 11
dng John Smith
khdc nhau.
Sii
dung phuang phdp ddnh gid B-CURED, nhdm tdc gid dua ra ket
qua xdc dinh nhap nhang vdi do do Fl khodng 84.6%.
[Mann and
Yarowsky,
2003]
tiip
can bai tdan
khii
nhap
nhdng
thuc
thi
theoc

hudng phan cum vdi cdc dac trung la thdng tin vl
tilu
su nhimg thuc
thi
ndy dugc
trich nit tir Internet. De cd the trich nit dugc cdc thdng tin ve tieu
sii,
nhdm tdc dua
ra mdt tap cdc mdu dang nhu {<name> was bom in <birth year> }. Nhdm tdc gid
da tien hdnh
thii
nghiem vdi hai tap
dii
lieu: (1) du lieu thuc te bang cdch dua ra
cdc truy van vdi Google vd lay ve cdc tdi lieu tham chieu den cac
thuc
the khdc
nhau nhung triing ten; (2) sinh ra mdt tap dir lieu gia (tuong tu nhu cdch thiic
ngudi ta sinh ra dir lieu
thii
nghiem cho bdi todn word sense disambiguation) bang
cdch thu thap tap cac tdi lieu ndi ve nhiing thuc the khdc nhau khdng triing ten, sau
dd thay the ten
ciia
thirc the nay vdi ten gid [PERSON-X]. Vdi cdch nay, ta sinh ra
tap tdi lieu ve cdc thuc the khdc nhau nhung cd cimg the hien ve ten la PERSON-
X. Tien hdnh
thii
nghiem vdi cdc
each lira

chgn dac tnmg khdc nhau va cac
phuang phdp danh trgng sd khdc nhau (TF-IDF, MI), nhdm dat dugc do chinh xac
(Precision) khoang 88% va do hdi tudng (Recall) la 73%.
[Gooi and Allan, 2004] ciing theo hudng tiep can phan cum nhung khdc vdi nhdm
[Marm and Yarowsky, 2003] d chd hg chi lay cdc tir trgng mdt ctra sd do dai 55
tii
xung quanh mdi tham chieu den thuc the nhap nhang va gdp chiing lai thdnh mdt
snippet cho mdi tai lieu. Nhdm tac gia tien hanh
thii
nghiem tren tap
dii
lieu gia
PERSON-X vdi cac
phuogn
phdp khdc nhau, dung do do dua tren khoang
each
Kullback-Leibler, vd phan cum phdn cap. Ket qua cao nhat ma nhdm tac gid dat
dugc khoang 88.2% vdi do do
Fl.
[Fleiman et. al, 2004] ciing
dimg
pham cum de xdc dinh nhap nhang nhung lai
theo mdt hudng
tiip
can khac. Trudc het, nhdm tdc gia trich nit ra cac cap
concept-instance tir Web vi du
Paul_Simon/Pop_Star,
sau do chen cdc cap nay vao
mdt Ontology da xdy dung san. Qua trinh chen ciing phdi ddm bao cac tham chieu
khdc nhau

ciia
ciing mdt thuc the phdi gdp lai vdi nhau (vi du
PaulSimon/Pop
Star vd PaulSimon/Single). Ngoai ra, cdc the hien ciing ten nhung
ciia
hai thuc
thi
khdc nhau ciing phai dugc phan biet (vi du
PaulSimon/Pop
Star khac vdi
Paul Simon/chinh tri gia). Sau dd, nhdm tac gia thuc hien viec xac dinh nhap
10
nhang theo mpt qua trinh hai budc: (1) hudn luyen mdt md hinh
cue
dai hda
Entropy de sinh ra cdc xdc sudt rang budc hai cap khdi
niem/thi
hien
Id
tham chieu
den ciing mdt thuc
thi.
(2) Su dung mdt phuang phdp phdn cum phan cdp de sinh
ra cdc cum, mdi cum tuong img vdi mdt thuc
thi.
*[Xin
Li, et. al., 2005] dua ra hai hudng
tiip
can ddi vdi bdi todn ndy: (1) nhdm tdc
gid tien hanh phan ldp

cue
bd (theo cap - pairwise) de xdc dinh xem lieu hai tham
chieu den cd cimg bieu dien mdt thuc the trong thuc te hay khdng. Sau dd, nhdm
tdc
gid
ciing tien hdnh phdn cum vdi dp do tuong tu giira mdi cap
Id
ket qua
ciia
bd
phdn ldp. (2) Hudng tiep can
thii
hai la xdy dung md hinh sinh, trong dd nhdm tdc
gid md hinh hda qua trinh sinh ra tdi lieu va cdch thiic cdc ten
(ciia
cac thuc the
khdc nhau) dugc nhiing vdo trong dd. Ve co bdn, nhdm tac gid dua ra cdc gid thiet:
(a) tdn tai mdt phdn phdi ddng thdi tren mdt tap cdc thuc
thi
(vi du mdt tdi lieu de
cap
din
tdng thdng Kennedy thi thudng dl cap
din
"Oswald" hay "White House"
hon
Id
"Roger Clemens"), vd (b) mdt md hinh "tdc gia" (author model), trong dd
gid thiet rang it nhdt mdt tham chieu
ciia

mdt thirc the trong tai lieu la de dang
nhdn ra dugc, va sau dd sinh ra cac tham chieu khdc thdng qua (c) mdt md hinh
the hien (appearance model), rang budc cdch thiic chuyen tir mdt de cap ro sang
dang mdt de cap bieu hien (cd
sir
bien ddi). Vdi ca hai hudng tiep can, nhdm tdc
*
r
gid deu thu dugc ket qud khoang 90-95%.
6.3.2.2. Mo
hinh
Giai quyet nhap nhang
thyc
the tieng Viet
r r r r /
Mdt sd hudng tiep can tren day tuy dem lai ket qua tdt nhung tuong ddi phiic tap
khi dp dung cho tieng Viet do thieu tai nguyen vd cdc cdng cu hd trg. Thdng qua
viec quan sat dir lieu vd khao sdt phuong phap, chiing tdi dua ra mdt sd nhan xet
sau:
- Mdt thuc the cd mdt sd
ITnh
vuc hoat ddng nhat dinh, vi du: : cac tai lieu
ndi vl Van Quyen lien quan den The-thao va Phap luat, trong khi cac tai
lieu ndi vl Hdng son chi lien quan den ve The thao.
- Mdt thuc the phdi nam trong mdi rang budc
ciia
cac thuc the khac. Vi du,
khi ndi den Van Quyen, ngudi ta thudng ndi den dgi bong SLNA, trong khi
ndi vl Hdng Son thi thudng
de

cap den The Cdng.
t r
LTnh
vyc
hoat ddng ciia thuc the mudn xac dinh dugc phai dua tren ca tir khda
trong tdi lieu trong khi cdc rang budc thyc the lai can phai xdc dinh thdng qua cac
thyc
thi
dinh danh trong tai lieu dd. Dya tren nhan xet nay, chiing tdi dua ra mdt
md hinh 2 budc cho viec xac dinh nhap nhang thyc the tren tieng Viet
11
- Phan cum vdi hd trg cua thdng tin
ITnh
vyc
dk
chia tai lieu ra thdnh cdc cum
khdc nhau.
- Ddi vdi mot sd cum trong dd do tucmg ty nhd nhat giira cdc cap tdi lieu nhd
han mot nguong ndo dd, chiing ta
tiin
hanh xdy dyng mdt md hinh sinh
tuang ty nhu
ciia
nhdm [Xin Li, et. al., 2005] dl phdn tich nhap nhdng thyc
thi
trong ndi tai cum dd.
Thyc te cho thay viec tdn tai hai thyc the ddng thdi thing hdan tdan ten va ve
ITnh
vyc hoat ddng khdng nhiiu, vi vay trong da sd trudng hgp, budc
thii

(1) la
dii
dl
xdc dinh nhap nhdng. Trong khudn khd dl tai, chiing tdi chi tap trung vao budc
thii
1, nghien
ciiu
xem viec
sii
dung phan cum vdi tir vyng vd
ITnh
vyc hoat ddng
dnh hudng the ndo ddi vdi viec xdc dinh nhap nhdng thyc the. Budc
thii
(2) trong
md hinh tren se dugc hodn thien dan cung vdi viec xdy dyng he thdng trich chgn
t
\
f
r
thyc the vd xdc dinh ddng tham chieu trong don tdi lieu trong tieng Viet.
De cd dir lieu
thii
nghiem, chiing tdi cung tien hdnh
Idy tii
tren cdc trang Web cdc
\
bai bdo ndi ve cdc nhdn vat khdc nhau nhu Trinh Cdng Son, Dam Thanh Son,
Hdng Son, sau dd thay the ten
ciia

cdc nhan vat ndy bdi PERSON-X.
Tii
day ve
sau, chiing tdi de cap den kho
dii
lieu nay la PERSON-X.
DI
xdc dinh ra cdc
ITnh
vyc trong tieng Viet, chiing tdi tien hanh phan tich
chii
de
an cho mdt tap
dii
lieu nen rat ldn
(chii
y la tap dii lieu nay khac vdi kho
dii
lieu
PERSON-X) Sli dung Latent Dirichlet Allocation (LDA)[20][18] . Qua trinh phdn
tich ndy dugc thyc hien khdng giam sdt nen khdng yeu cau nhieu tai nguyen
dii
lieu va cdng cu
xii
ly, vi the dac biet thich hgp ddi vdi tieng Viet. Dau ra
ciia
budc
ndy la mdt md hinh sinh tdi lieu theo
chii
de dugc the hien thdng qua phan phdi

xac suat
ciia
cdc tir tren cdc
chil
de an. Dya tren md hinh nay, ta tien hanh phan
tich
chii
dl cho kho
dii
lieu PERSON-X. Ke dd, thdng tin ve
chii
de se dugc ket
hgp vdi thdng tin tir vyng
ciia
cac tai lieu de tien hanh phan cum dimg phuang
phdp Hierarchical Agglomerative Clustering. Tdm tat ve phuang phdp thyc hien
dugc cho trong hinh ve sau.
Enriched Data
Entity Collection
Entity
Collection
Topic
Analysis
Documents
Terms

Documents
• •
• •
D

Hidden Topics
• • •
• • •
• •



HAC
Disambiguate
Entltyl
iEntity4|
Entity2
EntityS
Hinh
2 M6
hinh d6an
nhan va giai quyet nhap nhang thirc the tieng Viet
12
Dya tren md hinh
chii
dl dd dugc udc lugng, ta tien hanh phdn tich chii de
ciia
tap
tai lieu cdn xdc dinh nhap nhdng thyc
thi.
Sau budc ndy, tuong ung vdi mdi mdt
tdi lieu
Id di
ta cd hai vector: (1) mdt vector tir vyng trong dd trgng sd
ciia

mdi tir
4ugc
tinh theo phuang phdp TF; (2) vector
chii
de, trong dd trgng sd
ciia
mdi chu
de the hien xdc sudt md tdi lieu thudc
chii
de dd.
Nhd
phdn tich
MarkMahaney
cua
Citigroup cho rdng "khi Google tiep
tuc gidnh duac
thi
phdn thi viec
Yahoo vd Microsoft bdt tay la cdn
thiet de han che
suphdt
trien cua
Google''.
Teanti-Part:
Nha
phan_tich
Mark
Mahaney CitiGroup Google
tiep_tuc
han_che

phat trien
yahoo Microsoft google
Topic-Part: topic_30 topic_91
topic_91
topic_91
topic_98
ToDic
91"'
dich vu, mang, thong
phan mim,
web,
dien
internet.
thoai,
he
khach hang, tim kiem
linq
dunq.
qooqle,

tin.
cung
thong
, di_
cap.
/
dong,
Hinh
3 Phin
tir kh6a

(term part)
yh
phan chu de (topic part)
tucmg ung
cua mpt
tii
lieu dau
\ko.
0 day so
lan xuat
hi^n
cua mpt chu de li If thuan v6i trpng so cua n6 trong phan phoi chu de tra ve
K6t
hop hai thanh phan nay lai ta dugc vector
tuong
ung vai tai lieu
<f,
nhu sau
,
di =
{ti,t2,
J-K
rW\.
•••,w\v\]
Trong do ti la trong
s6 ciia
cac
chii
de va wi la trong so
ciia

cac tu khoa
tuong
iing.
Ta tinh do tuong tu giiia phan
chii
de va phan tu khoa giiia hai tai lieu di va dj
nhu sau :
•A
siiiidr^j {topic
- parts) =
Ilk=i kk X tj^k
22k=l^lk\ll2k=l^jM
simdi^dj
{term - parts) =
V
UtJl '^•r.t X Wj,t
^/EO^/EEH
13
Ket hgp hai gid tri ndy ta dugc do do tucmg ty cudi cimg giua hai tdi lieu di vd dj
sim^di^
dj) = A X
sim{topic —
parts) + (1

A) x
sim.{term.

parts)
O day
lamda Id

tham sd
di6u
khiin. Thdng thudng
lamda Idy
gid tri khodng tir
20%"30%ldtdtnhdt.
Vdi dg do tuong ty ndy, chiing tdi
tiSn
hdnh xdc dinh nhap nhdng thyc
th6
bdng
phuang phdp phdn cum HAC (Hierarchical Agglomerative Clustering)
(b)
Thuc nghiem
Mo ta dir lieu
Kho dir lieu PERSON-X gdm cd cdc thyc
th6
gdc trudc khi bi thay thg bdi
PERSON-X dugc cho nhu trong bang sau :
Thyc the goc
Bui Tien Diing
Biii
Xuan Phdi
Dang Thdi Son
Hiiynh
Hdng Son
Le Hdng Son
Nguyen Hdng Son
Phan Van Khdi
TRinh Cdng Son

Pham Van Quyen
Tong cong
So lirong
8
25
17
15
7
25
19
41
27
184
Bdng 1
Danh sach cac thuc the trong kho dir lieu
thu*
nghiem
Chung tdi tien hanh
thii
nghiem vdi hai tap : (1) Tap thir
nhat:
mdt tap bao gdm
cdc thyc
thi
gdc la
{Biii
Tien Diing, Dang Thai Son, Hiiynh Hdng Son, Le Hdng
Son, Nguyin Hdng Son, Phan Van Khdi, Trinh Cdng Son} ; (2)Tap thu hai : mot
tap bao gdm hai thyc the gdc la {Pham Van Quyen, Nguyin Hdng Son}
14

Mot doan van
bte
dh
cSp
din Van Quydn, PERSON-X la
sir
thay
thS
cua Pham
Van Quyen ,
QuySn,
hay
Van
Quyen trong van ban g6c:
[LTdrc
mong]
[d^
bong] [0] [PERSON-X] [sau] [4] [thang] [trong] [tu]
[vin]
*[chdy
b6ng] , [nhung] [lanh dao] [SLNA] [van] [chua] [thdng
nhSt] [chii
truang] [cho] [anh] [tra lai] [tap luyen]
[ciing]
[ca]
[doi].
[PERSON-X]
[v§n]
[phai]
[cha].

[Hom nay] , [lanh dao] [tinh] [va] [Sa] [TDTT] [moi] [ngdi] [lai] [vai] [lanh
dao] [CLB] [SLNA] [dd] [quydt dinh] [viec] [co] [cho] [PERSON-X] [quay
lai] [tap luyen] [hay] [khong].
Mdt doan van bdn de cap den Hdng Son :
[PERSON-X] [khi] [gidi nghe] [cung] [chang] [bdt] [ban rdn] [di] [chut ndo] .
[Khdng cdn] [gdn bd] [vdi]
[Thi
Cdng] [trong] [tu cdch] [cau
thii]
, [anh]
[vdn] [la] [khudn mat] [quen thudc] [khi] [trd thanh] [HLV] [ddi tre] .
[Vira] [sdng] [sdng]
,
[chiiu] [chiiu] [ldm] [thdy] , [chdng] [tien ve] [tai hoa]
[ngdy nao] [cdn] [quylt tam] [hgc] [xong] [dai hgc] [TDTT] [Tir Son] , [va]
[thdng]
[11]
[sdp] [tdi] [se] [tdt nghiep] [ndt] [dai hgc] [Luat] .
[Rieng] [chuyen] [trudng
Luat],
[ciing] [cd] [nhieu] [y kien] [thac mac] [khi]
Chiing tdi danh gid
kit
qua thyc nghiem
sii
dung phuang phdp B-CURED [1] vdi
ba do do Precision (do chinh xdc), Recall (do hdi tudng) va
Fl-Measure
(ket hgp
ciia

hai do do tren). Gid tri
ciia
ba do do ndy cang ldn cho thay chat lugng xac dinh
nhap nhang cang tdt.
15
1
-
0.9
-1
0.8 -
0.7 -

06
A
05 H
04
H
03
-1
02
n
1
-J
v.l
0
-^
\j
^^M
^
k

t
N.
Q 76
0.75
OJL^
4
4—'^^
^^vfe=SS^^
067 0 65
0.67
o.ssyV
^i^;^-^^.:^;!^!::^^
y^
f
^ "^^"A
jtf
-^-
/
y'
*^j^^
0.27
0.2/y
ipmmmmi£ t
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14
Clustering
Threshold
-•—Precision
HH
Recall
*-Fl-Measure

Hinh
4 Ket qua thirc
nghif
m
\(n t§p thu-c
the thu nhat
vd
khong
sir
dung phan tich chu
di
(Lamda==0.2)
Precision
Recall
Fl-Measure
0.01 0.02 0.03 0.04 0.05 0.06 0.07
0.08
0.09 0.1 0.11
0,12
0.13 0.14
Clustering Threshold
Hinh
5 Ket
qud
thyc nghif
m
v6i
tdp
thu-c the
thir

nhat va sir dung phan
tich chij
de
16
^.52—0.^2—051-
Precision
Recall
Fl
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
Clustering Threshold
Hinh
6 Ket qud xdc dinh
nhdp
nhdng khong dung phan
tich chiJ
de an trong tap gom hai thirc thi
g6c Id hai cau thii Hong San vd Vdn Quyen.
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.67 0.67 0.67
Precision

Recall
^—Fl
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
Clustering Threshold
PTinh
7 Ket qud xdc dinh nhap nhdng khi dung phan tich
chii
de an trong tap
g5ni
hai thyc the goc la
hai cdu thii Hong
Sffn
vd Vdn Quyen (Lamda=0.3)
Cdc
k6t
qud tren day cho thdy chi vdi cdc phuang phap khdng giam sat, chat lugng
xdc dinh nhap nhdng thyc the tren nhieu tai lieu ciing da kha tdt.
iTu
diem ciia
hudng nay
Id
khdng cdn ddi hdi nhieu tai nguyen dir lieu huan luyen hay cac tai
17
^^[TiScGuSc
GtA
HA
N0»_
v.
amiLLB±
nguyen cdng

cy xix
ly ngdn
ngii
vi
th6
ddc biet
phii
hgp vdi hien trang
xii
ly tieng
Viet hien thdi.
N6u k6t
hgp vdi phan tich
chii dk
dn, kdt qud xdc dinh nhap nhdng
dugc cdi thien ddng
k6
: vdi tap thyc
th6 thii
(1), do do Fl tdt nhdt tang khodng
4%
(tir 76% len 80%); vdi tap thyc
th€ thii
(2), dd do Fl tdt nhdt tang khodng 7%
(lir
80% len 87%) ca khi hai thyc thd ciing chia se mdt
ITnh
vyc hoat ddng (trong
trudng hop nay cd Van Quydn vd Hdng Scm
dhu Id

cdu
thii).
6.4. Dja diem,
thoi
gian va
phuang
phap nghien
CLPU

D6
tdi thyc hien trong mdt nam tir thdng 05/2007 ddn thdng 05/2008 tai Bd
mdn Cdc He thdng Thdng tin, Khoa Cdng nghe Thdng tin vd tai Phdng thi
nghiem muc tieu "Cdng nghe Tri thirc vd Tuong tdc ngudi mdy"
• Thu thap vd khao sdt cdc ndi dung lien quan tir Internet vd cac co quan ddi
tdc trong ciing
ITnh
vyc ngdn
ngij
hgc va
xii
ly ngdn ngu ty nhien
• Ket hgp nghien curu cdng nghe vd ly thuyet
• Td chiic seminar, tham gia cdc hdi nghi, hdi thdo lien quan den
ITnh
vyc
xii
ly ngdn ngu ty nhien
6.5. Kit qua nghien
CLPU
6.5.1.

Cac cong bd lien quan
din
dd tai
-
Mdt bdo cdo khoa hgc chuan bi
giii
dang tap chi ACM TALIP
• Cam-Tu Nguyen, Xuan-Hieu Phan, Thu-Trang Nguyen, Susumu
Horiguchi, Quang-Thuy Ha (2008). Web Search Clustering and
Labeling with Hidden Topics. The ACM Transaction on Asian
Language Processing (to be submited)
- Mdt bdo cdo khoa hgc guri dang Hdi nghi Khoa hgc Qudc te
• Dieu-Thu Le, Cam-Tu Nguyen, Xuan-Hieu Phan, Quang-Thuy Ha, and
Susumu Horiguchi (2008). Matching and Ranking with Hidden Topics
towards Online Contextual Advertising. The 2008
lEEE/WIC/ACM
International Conference on Web
Intelligence^
Sydney, Australia,
December 2008 (accepted)
- Mdt bdo cdo khoa hgc da dugc trinh bay tai Hdi thao Khoa hgc Qudc gia va
giii
todn van ndi dung
• Le Dieu Thu, Trdn Thi Ngdn, Nguyen Cam
Tii,
Nguyen Thu Trang. Xay
dyng Ontology nham hd trg tim kiem
ngii
nghTa trong
ITnh

vyc y te. Hoi
18
f
r
thdo Quoc gia ldn
thu
XI
"Mot
so vdn de chon loc cua CNTT vd Truyen
thong,
Hu6,
12-14/6/2008
- Mdt sd bdo cao ve trich chgn vd xu ly nhap nhang thyc the dugc trinh bdy
tai Phdng Thi nghiem ve tinh "Cdng nghe tri thiic vd tuong tdc
ngudi-mdy".
6.5.2. Kdt qua dao tao cua dd tai
{Noi dung nghien cuu trong luan dn, ludn vdn, khoa ludn tdt nghiep vd cong
trinh sinh vien nghien cuu khoa hoc gdn liin vdi noi dung nghien cuu thuc
hien de tdi)
- Mdt ludn van Cao hgc
ciia
NCS Nguyin Cam
Tii
vdi dt tdi "Hidden Topic
Discovery Toward Classification and Clustering" da bdo ve thdnh cdng
thdng 5/2008.
- Mdt khda luan tdt nghiep dai hgc
ciia
sinh vien Le Dieu Thu vdi de tai "On
the Analysis of Large Scale Dataset toward Contextual Advertisement" da

bdo ve thdnh cdng thdng
6/2008.
6.5.3. Kdt qua
iFng
dung cua dd tai
^
. t \
1
r
- Mdt kho dir lieu ve trich chgn thyc the va
xii
ly nhap nhang thyc the tieng
Viet.
- Mdt module trich chgn va
xii
ly nhap nhang tieng Viet viet tren ngdn ngir
Java vd
thii
nghiem tren kho dir lieu ndi tren.
6.6. Thao luan
Giai
quySt
nhap nhdng thyc thd la mdt trong nhirng bai tdan quan trgng trong dinh
hudng tim kidm thyc thd. Day la mdt hudng dang rat thdi sy trong thdi gian gan
ddy. Nhimg ndi dung nghien curu trong de tai
phii
hgp vdi ndi dung nghien
ciiu
tren thd gidi vi vay chiing tdi da
giii

mdt bai bao lien quan den nhimg phan
ciia
de
tdi ddn hdi nghi khoa hgc qudc te.
6.7.
Kit
luan
So sdnh vdi muc tieu dugc dat ra va ket qua thyc hien tren day, de tai da dap img
dugc cac muc tieu:
• Xay dyng dugc module vd kho dir lieu thyc nghiem bai tdan giai quyet
nhap nhdng thyc the
19
Trong qua trinh thyc hien de
tdi,
cdc cdng tdc vien da tham gia cdng bd ket
qud nghien curu
dudd
dang cdc luan van, khda ludn vd bdi bdo tai cdc hdi
nghi khoa hgc cdp qudc gia vd qudc te.
20
6.8.
Tai lieu tham
khao
[1],
Bagga and B. Baldwin,
'Tntity-Based
Cross-Document Coreferencing using
the Vector Space Model", Proc. 36th
Aimual
Meeting of the Association for

^
Computational Linguistics (ACL) and 17th Conf on Computational
Linguistics (COLING), San Francisco, California
pp.79-85,
Aug. 1998.
[2].
CH. Gooi and J. Allan,
''Cross-Document
Coreference on a Large Scale
Corpus", Proc. Human Language Technology/North American chapter of
Association for Computational Linguistics annual meeting
(HLT/NAACL),
Boston, USA, May. 2004.
[3].
G.S. Mann and D. Yarowsky,
''Unsupervised
Personal Name
Disambiguation", Proc. 7th Conf on Natural Language Learning (CoNLL),
Edmonton, Canada,
pp.33~40.
May.
2003.
[4].
M.B. Fleischman and E. Hovy,
''Multi-Document
Person Name Resolution",
Proc.
42nd Armual Meeting of the Association for Computational Linguistics
(ACL),
Reference Resolution Workshop, Barcelona, Spain, Jul. 2004.

[5].
X. Li, P. Morie, and D. Roth, "Semantic Integration over Text: From
Ambiguous Names to Identifiable Entities", AI Magazine, Vol.26.
No.l,
pp.45-58,2005.
[6].
J.R. Hobbs, "Resolving pronoun
references".
Lingua, vol.44,
pp,311—338,
1978.
[7].
R. Mitkov, "Robust pronoun resolution with limited knowledge", Proc. 36th
Annual Meeting of the Association for Computational Linguistics (ACL) and
17th
International Conf on Computational Linguistics (COLING), Montreal,
Quebec, Canada, pp.869-875, Aug. 1998.
[8].
J.F. McCarthy and W.G. Lehnert, "Using Decision Trees for Coreference
Resolution", Proc. 14th International Joint Conf on Artificial Intelligence
(IJCAI),
Quebec, Canada, pp.l050~1055, Aug. 1995.
[9].
W.M. Soon, H.T. Ng, and D.C.Y. Lim, "A Machine Learning Approach to
Coreference Resolution of Noun Phrases", Computational Linguistics,
vol.27,
no.4,
pp.521-544,
2001.
[lOJ.V.Ng

and C.Cardie, "Improving Machine Learning Approaches to
Coreference Resolutions", Proc. 40th
Anual
Meeting of the Association for
Computational Linguistics (ACL), Philadelphia,
USA,pp.l04-lll,
Jul.
2002.J. Han and M. Kamber. Data Mining-Concepts and Techniques.
Morgan Kaufmann, 2001.
[1
IJ.WordNet,

21
[12].Borthwick,
"A
Maximum Entropy Approach to Named Entity Recognition",
Ph.D.
Thesis (1999), Dept. of Computer Science, New York University.
[13].T.
Pedersen,
S. Patwardhan, and J. Michelizzi,
"WordNet::Similarity
~
Measuring the Relatedness of Concepts", Proc. Conf on Human Language
0
Technology/North American chapter of the Association for Computational
Linguistics annual meeting (HLT/NAACL), Boston, USA,
pp.38~41,
May.
2004.

[14].Google
Web APIs,
[15].Bagga
and B. Baldwin, "Algorithms for Scoring Coreference Chains", Proc.
the Linguistic Coreference Workshop at the first Conf on Language
Resources and Evaluation (LREC), Granada, Spain,
pp.563—566,
May. 1998.
[16].Nguyen
Cam Tu, "JVnTextpro: A Java-based Vietnamese Text Processing
Toolkif
[17].Nguyen
Cam Tu,
"JGibbsLDA:
A Java and Gibbs Sampling based
Implementation of Latent Dirichlet Allocation (LDA)
[18].Nguyen
Cam Tu, Master Thesis, College of Technology, Vietnam National
University
[19].Phan
Xuan Hieu,
"GibbsLDA++:
A
C/C++
and Gibbs Sampling based
Implementation
of
Latent Dirichlet Allocation (LDA)",
2007.
[20].Blei, D.M., Ng, A.Y. and

Jomal,
M.I. (2003), "Latent Dirichlet Allocation",
Journal of Machine Learning Research 3,
pp.993-1022
22
7. PHU LUC
Cdc phu luc chuyen mdn lien quan ddn ndi dung dd tdi
o Cdc bai bdo khoa hgc
• Cam-Tu Nguyen, Xuan-Hieu Phan, Thu-Trang Nguyen,
Susumu Horiguchi, Quang-Thuy Ha (2008). Web Search
Clustering and Labeling with Hidden Topics. The ACM
Transaction on Asian Language Information Processing (to
be submited)
• Dieu-Thu Le, Cam-Tu Nguyen, Xuan-Hieu Phan, Quang-
Thuy Ha, and Susumu Horiguchi (2008). Matching and
Ranking with Hidden Topics towards Online Contextual
Advertising. The 2008
lEEE/WIC/A
CM International
Conference on Web
Intelligence^
Sydney, Australia,
December 2008 (accepted)
• Le Dieu Thu, Trdn Thi Ngdn, Nguyin Cdm
Tii,
Nguyen Thu
Trang. Xay dyng Ontology nham hd trg tim kiem
ngii
nghTa
trong

ITnh
vyc y te. Hoi thdo Quoc gia ldn
thu
XI "Mot so vdn
de chon loc
ciia CNTTvd
Truyen thong, Hue, 12-14/6/2008
o Sao bia ludn van Thac Sy
ciia
NCS Nguyen Cam
Tii
va khda luan
dai hgc cua Le Dieu Thu
Ban sao de cuong de tai nghien
ciiu
da dugc phe duyet vd hgp ddng thyc
hien de tai
Ban tdm tdt
kit
qua dl tai bdng tiing Anh
Biiu
mdu
16/KHCN/DHQGHN
23
Web Search Clustering and Labeling with Hidden
Topics
CAM-TU NGUYEN. THU-TRANG NGUYEN. QUANG-THUY HA
College of Technology, Vietnam National University
and
XUAN-HIEU PHAN, SUSUMU HORIGUCHI

Graduate School of Information Sciences, Tohoku University.
Web search clustering is a solution to reorganize search results {also called "snippets") in a more
convenient way for browsing. There are three key requirements for such post-retrieval clustering
systems:
(1)
The clustering algorithm should group similar documents together; (2) Clusters should
be labeled with descriptive phrases; and
(3)
The clustering system should provide high quality
clustering without downloading the whole Web pages.
This paper introduces a novel framework for clustering web search results in Vietnamese which
targets at three above issues. The main motivation is that by enriching short snippets with hidden
topics from huge resources of documents on the Internet, it is able to cluster and label such snippets
effectively in a topic-oriented manner without concerning the whole Web pages. Our approach is
based on recent successful topic analysis models, such as Probabilistic-Latent Semantic Analysis,
or Latent Dirichlet Allocation. The underlying idea of the framework is that we collect a very large
external data collection called "Universal Dataset", and then build a clustering system on both
the original snippets and a rich set of hidden topics discovered from the universal data collection.
This can be seen as a richer representation of snippets to be clustered. The general framework is
flexible
and general enough to bc applied to a wide range of domains and languages. We carried
out careful evaluation of our method on real search results from Google and show that our method
can yield impressive clustering quality.
Categories and Subject Descriptors:
H.3.1
[Information Storage and Retrieval]: Content Anal-
ysis and Indexing;
1.2.7
[ArtiFicial
Intelligence]: Natural Language

Processing—language
mod-
els;
text analysis
General Terms: Text/Web Mining, Languages, Learning
Additional Key Words and Phrases: Latent Dirichlet Allocation, Hidden Topics Analysis, Viet-
namese, Web Search Clustering, Cluster Labeling, Collocation, HAC
Author's address: Cam-Tu Nguyen, PhD candidate at Graduate School of Informa-
tion Sciences, Tohoku University, Japan; College of Technology, Vietnam National Uni-
versity, Vietnam; email: ; Xuan-Hieu Phan, ; Susumu
Horiguchi, ; Thu-Trang Nguyen,
,
Quang-Thuy
Ha,

Permission to make digital/hard copy of all or part of this material without fee for personal
or classroom use provided that the copies are not made or distributed for profit or commercial
advantage, the ACM copyright/server notice, the title of the publication, and its date appear, and
notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to
post on servers, or to redistribute to lists requires prior specific permission and/or a fee.
© 2008 ACM 0164-0925/99/0100-0111 $00.75
ACM
TVansactionB
on Asian Language Processing, Vol. . No. , July 2008, Pages 1-??.
Web Search Clustering and Labeling with Hidden Topics • 3
A disadvantage is that repeatedly querying search engines is quite time-consuming
and not suitable for real-time applications. Another solution is to exploit online data
repositories, such as Wikipedia or Open Directory Project
*
as external knowledge

sources (Banerjee et al. 2007; Schonhofen 2006; Garilovich and Markovitch
2007)
Inspired by the idea of using external data sources mentioned above, we present
a general framework for clustering and
labehng
with hidden topics discovered from
a large-scale data collection. This framework is able to deal with the shortness of
snippets as well as provide better topic-oriented clustering results. The underlying
idea is that we collect a large collection, which we call the "universal dataset", and
then do topic estimation for it based on recent successful topic models such as pLSA
[Hofmann 1999), LDA
[Biei
et al.
2003).
Based on the estimated model, we perform
topic inference for search results to obtain the intended topics. The topics are
then combined with the original snippets to create expanded, richer representation.
Exploiting one of the similarity measures (such as widely used cosine coefficient),
we now can apply any of successful clustering methods based on similarity such as
HAC,
K-means
[Kotsiantis
and Pintelas 2004) to clustering enriched snippets. The
main advantages of the framework include the following points:
—Reducing
data sparseness: different word choices make snippets, which are in
a same topic, less similar, hidden topics do make them more related than the
original. Including hidden topics in measuring similarity helps both reduce the
sparseness and make the data more topic-focused.
—Reducing

data mismatching: some snippets sharing unimportant words, which
could not removed completely in the phase of stop word removal, are likely close in
similarity. By taking hidden topics into account, the similarities of such snippets
are decreased in comparison with other pairs of snippet. As a result, this goes
beyond the limitation of shallow matching based on word/lexicon.
—Providing
informative and meaningful labels: traditional labeling methods as-
sume that repetitious terms/phrases in a cluster are highly potential to be clus-
ter labels. This is true but not enough. In this work, we use
similarity
regarding
topics of terms/phrases and the cluster as an important feature to determine the
most suitable label, and providing more describtive labels.
—Easy
to implement: The framework is simple to implement. All we need to pre-
pare is to collect a large-scale data collection to serve as the universal data and
exploit the topics discovered from that dataset as additional knowledge in order
to measure similarity between snippets.
—Easy
to reuse: The remarkable point of this framework is the hidden topic analysis
of a large collection. This is totally unsupervised process but still takes time for
estimation. However, once estimated, the topic model can be applied to more than
one task which is not only clustering but also classification, contextual matching,
etc.
Also,
the framework is general enough to be applied to many different languages
and clustering methods. In this paper, we performed a careful evaluation for clus-
tering search results in Vietnamese with the universal dataset containing several
^Open
Directory Project:

Asian Language Processing, Vol, , No. , July 2008.
Web Search Clustering and Labeling with Hidden Topics 5
techniques to enrich data that need to be clustered. [Osinski 2003] used LSI to dis-
cover concepts in the collection of search snippets. [Ngo 2003) provided an enriched
representation by exploiting Tolerance Rough Set Model (TRSM). With TRSM, a
document is associated with a set of tolerance classes. In this context, a tolerance
class represents a concept that is characterized by terms it contains. For example,
{jaguar, OS, X} and {jaguar, cars} are two tolerance classes discovered from the col-
lection of search results returned by Google for the query "jaguar". [Banerjee et al.
2007) extracted titles of Wikipedia articles and used them as features for clustering
short texts. Toward measuring the similarity between short texts, [Bollegala et al.
2007] used search engines to measure the semantic relatedness between words. [Sa-
hami and Heilman 2006] also measured the relatedness between text snippets by
using search engines and a similarity kernel function. [Metzler et al. 2007) evaluated
a wide range of similarity measures for short queries from Web search logs. |Yih and
Meek 2007) considered this problem by improving Web-relevance similarity and the
method in [Sahami and Heilman 2006). [Garilovich and Markovitch 2007] computed
semantic relatedness for texts using Wikipedia concepts.
In order to meet the snippet-tolerance condition and take advantage of external
data sources, we focus on enriching snippets with hidden topics discovered from a
large external collection using topic analysis models, such as Probabilistic-Latent
Semantic Analysis
(pLSA)
[Hofmann 1999], Latent Dirichlet Allocation (LDA)
[Biei
et al.
2003],
Dynamic Topic Model (DTM)
[Biei
and Lafferty 2006], or Correlated

Topic Model (CTM)
[Biei
and Lafferty 2007). The idea of such enrichment using
latent topic models is originated from the recent work of [Phan et al. 2008] but
instead of building a classifier we apply it to the problem of search result clustering
and labeling in Vietnamese. In comparison with previous enriching techniques in
[Osinski
2003;
Ngo
2003;
Cai and Hofmann 2003], the difference of our proposal is
that they only discover semantic relationships within data to be processed rather
than in an external collection. In contrast to studies of Banerjee [2007], Bollega
|2007),
etc., we approach this issues from the point of view of
text/web
data analysis
techniques which have shown a lot of success recently (Hofmann 2004; Bhattacharya
and Getoor 2006; Griffiths and Steyvers 2004; Wei and Croft 2006].
3. THE GENERAL FRAMEWORK
In this section, we present the proposed framework that aims at building a cluster-
ing system with hidden topics from large-scale data collections. The framework is
depicted in Figure 1 and consists of six major steps.
Among the six steps, choosing a right universal dataset (a) is probably the most
important one. The universal dataset, as its name suggests, must be large and
rich enough to cover a lot of words, concepts, and topics that are relevant to the
domain of application. Moreover, the vocabulary of the dataset should be consistent
with future unseen data that we will deal with. This implies the flexibility of the
external data collection as well as of our framework. The dataset should also be
pre-processed to exclude noise and non-relevant words so the phase (b) can achieve

good results. More details of (a) and (b) steps for a specific collection in Vietnamese
will be discussed in the Section 5. Along with performing topic analysis, we also
exploit the dataset to find collocations (c) (see Section 6.3.1). The collocations are
then used for labehng clusters in (f).
Asian Language Processing, Vol. , No. , July 2008.
Web Search Clustering and Labeling with Hidden Topics
Fig. 2. The generative process of LDA
Probabilistic Latent Semantic Analysis
(pLSA)
[Hofmann 1999) was the succes-
sive attempt to capture semantic relationship within text. It relies on the idea that
each word in a document is sampled from a mixture model, where mixture com-
ponents are multinomial random variables that can be viewed as representation of
"topics". Consequently, each word is generated from a single topic, and different
words in a document may be generated from different topics.
While Hofmann's work is a useful step toward probabilistic text modeling, it
suffers from severe overfitting problems [Heinrich 2005). The number of parameters
grows linearly with the number of documents. Additionally, although pLSA is a
generative model of the documents in the estimated collection, it is not a generative
model of new documents. Latent Dirichlet Allocation (LDA), first introduced by
Biei
et al, [2003], is one solution to these problems. In general, it is a generative
model that can be used to estimate the multinomial observations by unsupervised
learning. Though, there have been some other topic modeling methods proposed
recently such as Dynamic Topic Model
[Biei
and Lafferty 2006], Correlated Topic
Model
[Biei
and Lafferty 2007], these models are more complex than LDA. This is

why we choose LDA for the topic analysis step in our proposal. More details about
LDA are given in the subsequent sections.
4.1 Latent Dirichlet Allocation (LDA)
LDA is a generative graphical model as shown in Figure 2. It can be used to model
and discover underlying topic structures of any kind of discrete data in which text
is a typical example. LDA was developed based on an assumption of document
generation process depicted in both Figure 2 and Table I. This process can be
interpreted as follows.
In LDA, a document
Hjm
=
{wm,n}n=i is
generated by first picking a distri-
bution over topics
-dm
from a Dirichlet distribution
(DirCa)),
which determines
topic assignment for words in that document. Then the topic assignment for each
word placeholder
[m,
n) is performed by sampling a particular topic
z^.n
from
multinomial distribution
Mult('0
rn)-
And finally, a particular word
Wm,n is
gen-

erated for the word placeholder
[m,n]
by sampling from multinomial distribution
Multi'^z^J.
FVom
the generative graphical model depicted in Figure 2, we can write the joint
distribution of all known and hidden variables given the Dirichlet parameters as
follows.
Asian Language Processing, Vol, , No. , July 2008.

×