T?-p chi
Tin
h9C
va
Di'eu khien h9C, T.20,
S.4
(2004), 293-304
, , •. ,c
GIAI PHAP TIM KIEM TRANG WEB TlfONG
ru
, , ,c
TRONG MAY TIM KIEM VIETSEEK
PHAM TH:J:THANH NAM, BlJI QUANG MINH, HA QUANG THl)Y
Khoa Gong ngh¢, Dei h9C Quac gia Ho. N(Ji
Abstract. This article describes some of our propositions to upgrade the search function of the
Vietseek by adding a vector representation solution for web pages. It alsoproposes the vector repre-
sentation for web pages, a calculating formula for components of the vector, a "text-based similar"
measure of two web pages, and algorithms to find out text-based similar pages of a given web page.
Somerealizations for above propositions n. the Vietseek are described too.
Tom
Hit. Bai bao nay trinh bay mot so de xuat
giai phap nang
cap chirc
nang
tirn kiern
cua
may
tim kiern tieng Viet Vietseek thong qua viec b6 sung bieu dien vector cho trang web. Phuong phap
bi~u dien vector cho trang web, cong thirc tinh toan thanh phan vector bieu dien, d9 do "tirong ttr
theo n9i dung" giira hai trang web va thuat toan tim kiern cac trang web tirorig tir voi mot trang
webda cho duoc de xuat. Plnrong phap cai d~t cac de xuat tren day trong may tim kiern Vietseek
cling
duoc trinh
bay.
1.
Ma
DAD
Khai pha text, d~c biet la khai pha web, hien duoc n'l:t nhieu to
chirc,
nha khoa h9C quan
.m nghien ciru, trien khai va ket qua cua nhieu c6ng trinh nghien ciru da diroc c6ng bo (xern
~:ang MQt so bai toan dien hinh
"rang khai pha web la bieu dien trang web, xU-11(tirn kiem, phan lap, kham pha luat.), khai
pha
web-site M6 hinh vector la mo hinh bieu dien van ban dien hinh va
diroc
su- dung
rQngJai nhat. Co rat nhieu each xac dinh gia tri thanh phan cua vector bieu dien. Cac
giai
phap
xU-ly van ban thirong giin bo mat thiet voi each bieu dien dircc chon. M~c du vay, voi
moi each bieu dien van ban da cho, nghirmroi ta co the SU-dung nhieu giai phap xU-ly khac
nhau; chang han voi cling mot each bieu dien vector, co the SU-dung nhieu thuat toan phan
lap
dira tren cac tiep can Bayes,
k
ngirci lang gieng gan nhat
(k-NN),
cay phan lap
May tim kiern, dien hinh
nhir
Yahoo, Google, Altavista, la cong cu tim kiern rat hiru ich
khi lam viec tren Internet. Do dinh huang muc tieu giai quyet bai toan tim kiern, bieu dien
trang web trong may tirn kiern co mot so net dQc dao. M~t khac, cac may tim kiern hien tai
chua de cap nhieu
toi
nhirng giai phap khai pha web khac ngoai bai toan tim kiern.
Trang bai bao nay, chung toi dinh huang vao viec nang cap chirc nang tim kiern nho bo
sung bieu dien vector trang web doi vo
i
may tim kiern tieng Viet
thir
nghiem Vietseek do
cluing toi nghien ciru, xay dung.
Muc 2 cua bai bao gioi thieu mot so c6ng trlnh nghien
ciru
co
noi
dung lien quan den bai
bao. Muc 3 gici thieu mot so noi dung
CO'
ban ve cau true va heat dong cua may tirn kiern
Vietseek. Cac de xuat giai phap trong bai bao nay (bieu dien vector trang web, dQ do "gan
294
PHAM TH~ THANH NAM, BUI QUANG MINH, HA QUANG THVY
nhau theo noi dung" giira hai trang web, cong tlnrc tinh toan thanh phan vector bieu dien,
thuat toan tirn kiern cac trang web tirong tir) diroc trinh bay trong Muc 4. Muc 5 gioi thieu
mot so ket qua cai d~t trong may tim kiern Viet seek va ban luan.
A "" •• , ••••.
2. MOT SO CONG TRINH NGHIEN CUU LIEN QUAN
Trong [6], cac tac tac gici da trinh bay mot so ket qua nghien ciru ve khai pha text su dung
mo hinh vector. Gicii phap tir dong nghia, da ngon ngir va thu nghiem gicii phap cay phan
lap cling da diroc trlnh bay
a
bai bao nay. Trong [7], Sen Slattery trinh bay tong hop cac
phirong phap bieu dien va xu
11
sieu van ban (hypertext), d~c biet la cac thuat toan phan lap
(Bayes,
k-NN,
FOIL, v.v.). Holger Billhardt, Daniel Borrajo va Victor Maojo [3], Son Doan
va Horiguchi [8] de xuat cac gicii phap bieu dien mo
i
cho phep tang ngir nghia cua vector bieu
dien van ban khi tinh den tinh phu thuoc ngir nghia cua cac tir khoa. Thorsten Joachims
[9],
Hwanjo Yu, Jiawei Han va Kevin Chen-Chuan [4] trinh bay nhirng gicii phap tang cirorig chat
hrong xu ly van ban theo dinh huang tai ngiroi su dung. Martin Ester, Hans-Peter Kriegei
va Matthias Schubert [5] giai thieu giai phap phan lap web site cua cac cong ty loai nho tren
ca sa thiet lap cay bieu dien co su dung mo hinh vector. N9i dung cac bai bao khac [1,2,7]
bo sung noi dung cac bai noi tren day nham cho phep nhan diroc mot cai nhin toan dien hen
ve khai pha web hien thai.
, ,
3. MAY TIM KIEM VIETSEEK
Viet seek la mot may tim kiern tieng Viet, duoc chiing toi nghien ciru phat trien tir phan
mern ma nguon me ASPseek trong khuon kho De tai QG-02-02 va diroc trien khai trong mot
du an thir nghiem cua Mang TTVN Online hop tac voi VDC1. Trong phirong an ban dau,
Viet seek co diu true cua mot may tim kiern thong thirong. Mo hinh hoat dong cua Viet seek
diroc rno tci trong hinh 1.
••
Search
Daemon
Hinh
1. Mo hinh hoat dong cua Viet seek
Co sa dir lieu ve cac trang web va chi muc diroc hru trir trong may phuc vu ca sa dir
lieu. Modun tim kiern (Search Deamon) la mot tien trinh chay ngarn hoat dong theo ca che
client/server, co nhiern vu lap danh sach cac URL thoa man yeu cau cua ngiroi dung va sau
do tinh hang hien thi cho tat
d
cac trang theo bon yeu to roi nhom theo site va slip xep tir
tren xuong. Modun giao dien (Web Server) lam nhiem vu lay ket qua tra ve tir modun tim
kiern, tron lai roi hien thi diroi dang web cho ngiroi dung.
Khi tinh hang trang web, h~ so ham
d
diroc chon la 0,85,so vong l~p tlnh toan la khoang
20 (cho khoang vai trieu trang).
GIAl PHAp TiM KlEM TRANG WEB TU0NG
TV
TRONG MAy TiM KlEM VIETSEEK
295
Hien tai, Viet seek tfnh hang hien thi cho mot trang web dira van bon yeu to sau:
1. Vi tri xuat hien cua tir kh6a trong van ban.
2. V~ tri ttro ng doi giira cac tu kh6a trong trang.
3. Thu9C tinh cua tir kh6a (tu tirn kiern d~t trong the
HI, H2, , H5).
4. Gia tri hang cua trang.
Co
sa
dir lieu cua Viet seek
Ca so' dir lieu cua Viet seek diroc chia thanh 2 phan.
Phan 1: dir lieu ve noi dung trang
web, mien (site), tir kh6a
ducc
hru trir trong cac bang cua
CO'
so
dir lieu Mysql.
Phan 2:
dir lieu chi muc (index) diroc hru trir rieng va c6
CO'
cau rieng. Be dat diroc toc 0.9 xu If cao
nen
khong dung
CO'
so dir lieu Mysql ma diroc hru trir trong cac file nhi phan khac nhau.
Qua trinh tirn kiern chi truy nhap den Phan 2, con khi hien thi ket qua mo
i
truy nhap
den Phan 1. Sau day la chi tiet each bieu dien cac dir lieu trong hai phan.
Pban
1:
Dii lieu auqe luu ttii trong cec bEing ctia co sa'
dii
li?u MySQL
*
Thong tin ve cac site diroc hru trir trong Mng sites
Ten tr iro'ng
Mieu
ta
Sit.e.Id
Ma nhan dang cua site
Site N9i dung cu the cua ten site (vi du www. Yahoo.com)
*
Thong tin ve cac URL (la thong tin ve cac trang web) diroc hru trong bang urlword
(bang nay hru giir thong tin ve tat
d
cac URL dii duoc tao chi muc va cac URL chira tao
chi muc
Ten tr iro'ng
Mieu
ta
urUd
Ma nhan dang cua URL (cua trang web)
site.Id
Ma nhan dang cua site chira trang 0.6
deleted Diroc gan gia trj 1 neu may chu tra ve loi 404, hoac cac quy dinh
II
(duoc thiet d~t cho chuang trinh) khong cho phep tao chi rnuc cho
trang nay; ngiroc lai la 0
url
N9i dung cua URL cua trang
next.Index.t ime
Thai gian cua Ian tao chi muc tiep theo, gia tri la "giay"
status La gia tri kiern tra tinh trang HTTP do may chu tra ve, hoac c6 gia
tri la 0 neu trang nay clnra diroc tao chi muc
ere Ma kiern tra cua trang (MD5 checksum: thuat toan ma h6a MD5)
lasLmodified Gia tri kiern tra "HTTP header" cua trang, do may chu HTTP tra
-c-,
ve
etag Gia tri "Etag header" do may chu HTTP tra ve
lasLindex_time
Thai gian cua Ian tao chi muc
truce,
gia tri la "giay"
referrer
Ma nhan dang (urLid) cua trang dau tien tham khao den trang nay
tag
M9t the dai dien nao 0.6
hops
B9 sau cua trang trong cay lien ket
redir Ma nhan dang
(url.id)
neu url hien thai diroc g~p lai hoac 0 neu url
chira diroc g~p lai
origin Mii nhan dang cua trang gdc ma trang hien tai la ban sao, Neu n6
khong phai la ban san thi trirong nay nhan gia tri la 0
296
PHAM TH~ THANH NAM, aut QUANG MINH, HA QUANG TH1.)Y
*
Bang wordurl hru giir cac thong tin
ve
moi tir trong co s6- dir lieu, moi ban ghi tuong
irng voi mot tir
T€m
tr
uo'ng
Mieu
t:i
word
Liru giir tir kh6a
word.Id
Liru giir ma cua tir kh6a
urls
Liru giir thong tin
ve
cac site va cac URL ma tir xuat hien. Neu kich
thiroc thong tin Ian hon 1000 byte thi gia tri cua
truong
nay se ding
va thong tin se duoc hru giir 6-trong cac file rieng biet khac co ten la
wordurl.urls
urlcount
Tong so hrorig cac trang web (URL) chira tir kh6a
totalcount
Tong so ran xu at hien cua tjr kh6a trong tat d cac trang web (URL)
*
Bang citation (hru giir cac thong tin
ve
chi muc dao cua cac sieu lien ket)
Ten t.riro'ng
Mieu
t:i
urLid
Ma nhan dang cua URL
referrers
MQt mang gorn cac urUd cua cac trang co lien ket den trang nay
Phan 2:
Dii
lieu chi
rnuc
duoc luu trong cec file nhj phan
Cau true file wordurl.urls (file nay hru trir cac thong tin
ve
cac site va cac URL ma tir
kh6a
xuat
hien, neu kich
thuoc
phan nay trong
gici
han 1000 byte thi diroc hru trir trong
tnrorig urls thuoc bang wordurl):
Cec thong tin ve
cac
site, duoc sap xep theo site.id
Offset
D{l dai
Mieu
ta chi
WH
0
4 Gia tri offset bat dau thong tin
ve
site thir nhat ma tir xuat hien
4
4
Ma nhan dang cua site thir nhat no
i
tir xufit hien
8 4
Gia tri offset bat dau thong tin
ve
site thir hai matir xuat hien
12
4
Ma nhan dang cua site tlnr hai noi tir xuat hien
(N-1)8 + 4 4
Gia tri offset bat dau
ve
site thir
N,
voi
N
co gia tri bang tong
so cac site ma tir xuat hien
(N-1)8 + 8 4
Mii nhan dang cua site thir
N
noi tir xuat hien
Thong tin ve cec URL, auqe luu
ttii
tiep ngay sau thong tin ve site.
Gui trj offset auqe tfnh
iii
0
0
4
urLid cua trang thir nhat trong site thir nhat trong phan thong
tin
ve
cac site
4
2
Tong so tir trong URL nay
6 2
Vi trf thir nhat
8
2 Vi trf thtr hai
6 + (N-1)2 2
Vi trf thir
N,
voi
N
la tong so tir xuat hien trong URL
L{fp l<;livai cec thOng tin eho cac URL ciia
ciuig
site, nhung e6
utl.id
Ian han
url.ui
cua phan tren
L{fp l<;livai cec thOng tin ve URL
ciia
site tiep theo trong pban thOng tin ve site
GIAl PHAp TIM KlEM TRANG WEB TtJONG
TV
TRONG MAy TIM KlEM VIETSEEK
297
~ " ~ A
4. THU~T TOAN TIM KIEM THEO NQI DUNG
TRONG M.AY TIM KIEM VIETSEEK
Nharn dinh huang vao viec tim kiern theo tir khoa nen ooi
tirong
chinh cua each bieu dien
trong ASPseek la cac tir khoa , thong tin ve
sir
xuat hien cua cac tir khoa trong cac trang
diroc sap xep theo
word.id
va oUQ'Chru trir trong cac file nhi phan. To chirc hru trir nhir vay
giup
cho viec tim kiern nhanh va hieu qua.
•• Google Sea.ch: Bu. Quang Minh· Microsoft Internet Explorer
I!I~ EJ
De Edit ~iew
F~volite$
1001s
tielp
m
: •.• .0
:;J
::1r ~
iJ ~
-JJ -
Back Stop Refresh Home
Sl!lc~lch
Favo,ites
HistOfY
Mail
!
i
A,ddles$ I~ http:}
IwwW.9009\e.comJ-seerch7hl-ent.ie-1
SO ·8859·1 t.Q-8
ui+Q
ueng+Minht.btnG
-Google+Search
.::J
Discuss
iJ
?Go
I SUIHlar pa')E-s
ASPseek Users 0208 Re faseek·devell Raqes ranks
Subject: Re: [as eek-devel] pages ranks. __From: Bui Quang Minh ()
Date: Sat Aug 172002 12:52:27 EDT Regards, Bui Quang Minh,
• uu
III·I/,Hil·l!lf·til"'.p
,lq·i·, 11",·,"./ll,'IIHllIl]l-,;-;
hlnd.:1h
,II' ~ -
311)"111::11
1&9.!:~
[
GREEr>.1PALIvI Galle,,/ Artists
Nguyen Quang Minh. " Biography. Please click on imaqe to see enlarged
view. Two sisters Oil on canvas - 60)(70cm Click here to order
Re faseek-develJ Dages ranks
From: Bui Quang Minh; Subject: Re: (aseek-deveIJ pages ranks [aseek-devel] pages
ranks Daniel Provencher: Re: [as eek-devef pages ranks. " Bui Quang Minh;
""'h-'(o,f
rndd-<lI'_r,Ple lOIli/"",,':l-h-dE-';.!li;~!II::IS 8spllllU;; (u/ rnS9rJiJ~:1/ tuml . 0k
-lIp' -
Sirrill<ll
paqE'i
faseek·devell Another bug?
_ (aseek-develJ Another bug? From: Bul Quang Minh; Subject· [as eek-devel] Another
bug? Date: Mon, 26 Aug 2002 20:57:40 -0700 Regards Bul Quang Minh:
r(I,:jll·::.I.:hl""~'
cornjasE-ek-dt:·v.?h~!II',l·:'
as.ptmu»
lul
r-(lsoOCJ3~~,1
html . S~
oii'c''j'·
( !.,·Io,', '~~ Jlt ~r, r I
'i
.11
J
I ; If< ":
J
Horne [ Artists [ Galleries [ EXlllbltlons [ Catalogue [ Contact Us
NGUYEN QUANG HAl. NGUYEN VIET HAl. PHAM VIET HAl. DANG HONG HAL BUI QUANG HAl. VU
.=J
~. - - - ~. . r
i
i~
lntemet
!;j!SI ••
tl
~:LJ~
0$:-;~ ~
-"PTNom
II~Goool.S.a ~o;.ydenghicho ·I'~~lnbo'.Outloo
·1
~.,£
OCi+~f"g
N5PM
Hinh 2. M9t phan ket qua tim kiem cua Coogle ooi
vci
cum tir "Bui Quang Minh"
Cac may tirn kiem hien nay cho phep ngiroi dung dira cau hoi vao thirong
a
dang rat don
gian gom mot hoac mot s6 khOng nhieu cac t.ir khoa. VI vay, may tlm kiern
thirong
cho tap
hop gorn rat nhieu trang web ket qua chira cac tir khoa trong cau Mi. VI le fio, may tim
kiern can co giai phap og hien thi cac trang web ket qua sao cho nhirng trang co hang cang
cao cang diroc hien t.hi
truce.
Dg tinh hang cua mot trang, trong cac may tim kiern, thirong
SIT
dung cong thirc bao ham duoc mdi quan h~ giira cac gia tri hang cua cac trang web co
lien ket Ian nhau. Tuy nhien, bai toan tinh hang hien thi van con mot s6 van oe can giai
quyet. Chang han, khi ngiroi dung yeu cau may tirn kiern Coogle tirn cac trang web co chira
cum tir "Bui Quang Minh" thi he thong cung cap ket qua hien thi trang khong chira cum tir
"Bui Quang Minh" 19-ixuat hien
tnroc
mot trang co chira cum tir 00 (hlnh 2). VI v~y, van de
nghien ciru oe xuat each thirc og may tim kiern tiep nhan dang cau hoi phirc t9-P hem, bieu
dien oay ou hon noi dung nguoi dung can quan tam va cho cau tra loi chinh xac han van
dang duoc tiep tuc nghien ciru hien nay [3,5,6,8]. May tim kiern Coogle oa cung cap mot
kigu hoi dang "Similar pages" song trong nhieu truorig hop, ket qua hien thi trang
"tirong
tv" co noi dung khac nhieu so voi noi dung cua trang dang xem xet (hlnh 3). Diroi oay la
nhirng oe xuat rno rong dang cau hoi va giai phap tim kiern
diroc
ap dung cho may tirn kiern
Vietseek thong qua viec bo sung chirc nang tim kiern cac trang web "tuang tv theo noi dung"
voi trang web hien thai oUQ'Chien thi cho nguo: dung.
Khai niern "tirong tv theo noi dung" cua cac trang web diroc xac dinh thong qua mot d9
298
PH,6,M TH~ THANH NAM, BUI QUANG MINH, HA QUANG TH\JY
do "gan nhau" gifra cac trang web theo mot each bieu dien trang web diroc chon. Nhir
v~y,
can bo sung cho may tim kiern mot each bieu dien moi cho trang web va xac dinh mot
0.9
do
gan nhau giira cac trang web theo each bieu dien da cho.
§Google Search: lelated:www.mad-a.chiv8.com/
.r
u/mag00317.html- Microaoft Internel Explore.
Bra 13
Eile Edit ~iew F~vorites
lools
Help
.IDI
i ~ _ .•• . ~
.1:J ~ '~
.iJ
0
I
<8-
a
too _
§]
j'
I
Back" Stop Refresh Home
j
Search Favorites History; Mail Print Edit Discuss
I A,ddress ~ gle com/search?hl-en&lr:::&ie=U T F-8&qarelated:www,mail-archive.com/aseek-deveI440Iists.asplinux.ru/mso0031 7.html .•.
f
Go
G 1
A.dvanr:ed Search Preferences Lanquage Tools Search TIps
-0
l
)8
e
Irelatedwww.mail·Brchive.com/BS Google Search
I
Searched for pages similar to
Results
1 - 10
of about
1
The
lviall
Archive
The Mail Archive What is
it?
An easy-ta-use archiving service for electronic mailing
lists What can you do here? Read or search Archives What about content?
Archiving
service
for public mailing lists
': t ,
!
-,I,,·!
'J
I 1
rei
H! _'-Ilr 111"'-
"n, -;~
-
MHonArc Honw Page
Home address: < An Email-to-HTML converter Contents.
Custormz able
ematl to HTML converter. Used for building archives for mailing lists.
,11 1t. '.".:;
\'\-V-; "I \1::1' WI • •
du
lllIiI'·iit:tlou,Jirldlllr!:-i!l'
hlrnl . 11 k -
dl,:II;
11 fLICjl
ISlte Ser'v'lces Inc.
Work, About ISite, Anytime, Anywhere. Work Anytime, Anywhere. Managed Security Servicas.
Web Developer Opportunities. Products & Services, Partnership, News, About Us
Offers design, commercial web hosting, and e-commerce
services.
.
_.
~
@]
ijllStart!
- r-,-
III)
Internet
:iI
r6
0
~i , -"PTN.! ~jvDCIl~Go ~gi.yd.! ~ilnbo.! ~iOutlo.•!~
'~ i
~O+!!l~
638PM·
Hinh. 3. Trang ket qua tirn kiern "Similar pages" cua Google
4.1.
BiE1u
dien trang web
Dinh huang
toi
muc tieu toi thieu ve khong gian hru trir va tang toc dQ tim kiem, cluing
toi lira chon mot phirong phap
moi
bieu dien vector cho trang web va c6 tinh den viec lien
ket noi dung cac trang web lang gieng.
Trong [7], Sen Slattery trinh bay bon phirong phap bieu dien trang web theo mo
hinh
vector, trong do ba phirong phap bieu dien sau
Slr
dung noi dung cua cac trang web Icing
gieng, Qua thirc nghiem, tac gia chi ra r~ng phirong phap thir ba cho ket qua tot han phirong
phap thir nhat (phuo ng phap bieu dien khong
Slr
dung thong tin lien ket voi cac trang web
khac). Tuy nhien, theo each bieu dien nhir v~y thi dQ dai vector bieu dien trang web lai tang
len gap doi (do vector bieu dien duoc to chirc thanh hai phan). Dieu d6 kh6ng chi doi hoi
kh6ng gian hru trir dir lieu phai tang gap doi ma thai gian tinh toan cho cac bai toan
phan
lap va tim kiern cling tang len voi h~ so nhir vay.
Cach bieu dien thir hai coi sir xuat hien cac tir kh6a trong cac trang lang gieng c6 trong
so b~ng sir xuat hien cac tir kh6a cua trang web dang xem xet. Hai each bieu dien cuoi tinh
den viec phan biet sir xuat hien cua tir kh6a trong trang web hien thai khac voi sir xu at hien
cua chinh tir kh6a do trong cac trang web lang gieng. Tuy nhien, dQ dai vector bieu dien
lai
tang nhanh (gap doi theo each tlnr ba, va gap nhieu Ian theo each tlnr tu). CM tien dircc
ae
xufit
(y
bai bao nay la dung hoa each bieu dien tlnr hai va hai each bieu dien cuoi.
NQi dung chu yeu theo each bieu dien cua clning toi la:
- Kich thiroc cua vector bieu dien kh6ng tang: b~ng so hrong cac tir kh6a trong h~ thong.
GIAl PHAp TIM KlEM TRANG WEB TUONG
TV
TRONG MAy TIM KlEM VIETSEEK
299
- Dira van trong so phan biet ve sir xu at hien cac tir khoa trong trang web dang xet va
cac trang web lang gieng cua no. Chi tiet hem, trong so la khac nhau ooi voi ba 100-itrang
web lang gieng: co ca lien ket di va toi, chi co lien ket di, chi co lien ket toi. Chang han,
trong so cho trang web dang xet co he so 4, trang web co ca lien ket di va tai co h~ so 2 va
trang web lang gieng thuoc mot trong hai dang cuoi co h~ so
1.
- Vector bieu dien duoc "chuan hoa" then nghia cac thanh phan cua vector la cac so
nguyen va tong cac thanh phan la mot hang so. Nhir vay, voi vector bieu dien bat ky
x
= (X
I
,X
2
, ,XN) thi Xl +X2 + +XN
=
C (C la h~ng so, cluing toi chon C = 100
then nghia "so phan tram"). Ngoai tac dung thuan tien trong tfnh toan, giai phap nay can
mang mot
y
nghia la h~ thong khong phan biet vai tro cac trang web then oQ dai.
4.2. Xac dirih d(>gan nhau ve noi dung cac trang web
Nhir trinh bay
a
tren, each bieu dien vector duoc chon nharn the hi~n nhieu ngir nghia ve
n9i dung cua trang web. Durri day cluing toi dira ra oQ 00 ve tinh "tirorig tv then noi dung"
cua hai trang web thong qua mot oQ 00 gan nhau cua hai vector bieu dien. Voi hai vector
cho
truce,
chung toi oe nghi
Slr
dung eosin cua goc giira hai vector 00 lam oQ gan nhau Sm
cua cluing [6]. Gia
Slr
co vector bieu dien X
=
(X
I
,X
2
, ,XN) va Y = (Y
I
,Y
2
, ,Y
N
) thl
d9 gan nhau Sm(X, Y) cua hai vector nay la cos(X, Y) cua goc tao boi X va Y oUQ'Ctinh
then cong th ire (1):
LX
l
*
Yi
Sm(X, Y) = cos(X, Y)
=
1 .
V
LX
?LYi
2
1 1
(1)
Khi cai o~t trong Vietseek, cluing toi tinh toan gia tri hang hien thi cac trang web gan
nhau la to hop giira oQ gan nhau then cong tlnrc (1) voi gia tri hang cua trang web can hien
thi (cong tlnrc (3) sau Thuat toan 2 tai Muc 4.5).
4.3. Xay dirng vector bi~u di€in trong may tlm kiern
Trong may tim kiern, noi dung cac bang chi muc (chi muc noi dung, chi muc lien ket, chi
muc ngiroc ) cho oay du thong tin oe chung ta xay dirng diroc he thong cac vector bieu
dien. Diro
i
day la mo ta sa hroc ve noi dung nay (cac thuat toan chi tiet cho viec xay dirng
cac vector bieu dien diroc trinh bay trong Muc 4.5).
Xay
dtrng vector chira chuan hoa: so IUQ'ngthanh phan b~ng so IUQ'ng tir khoa trong hQ
thong, moi thanh phan trong vector tircng ling voi tir khoa then chi so WordID. Gia
Slr
dang
xem xet trang web
P
va tir khoa
W,
nhan duoc danh gia xuat hien cua tir khoa
W
trong
P la
nl,
tong danh gia xuat hien cua tir khoa W trong tat ca cac lang gicng co lien ket hai
chieu vo
i
P
la n2, tong danh gia xu at hien cua tir khoa
W
trong tat
d
cac trang web lang
gieng can 10-ila
n3.
Khai niem "danh gia xuat hien" tir khoa
W
trong mot trang web diroc
hieu la tong cua cac Ian xuat hien cua tir khoa
W
trong trang web do vo
i
h~ so vi tri cua
tung Ian xu at hien
(a
tieu de,
a
the thuoc tinh,
a
sieu lien ket,
a
than trang web ). Khai
niern nay tirong tv khai niern "trong so xuat hien" (weight values for all of appearances) tir
khoa
W
trong van ban D [6]. Chung toi tinh gia tri
nw
tircng ling voi thanh phan
W
trong
vector bieu dien trang web
P
nhir sau:
(1)
3
lVw
=
Lnw
(chu
y
~lVw
=
1OU).
(::!)
w
w
Chu
y
ding, khi cai d~t Vietseek doi voi mot to clnrc cu the, chung toi dinh huang t{Yi
iec cho phep nguo
i
dung he thong dinh nghia tap tir kh6a chuyen nganh va
VI
the
09
dai
ector bieu dien khong Ian.
.4.
Cai
d~t trong Vietseek
Be tinh diroc tong danh gia xuat hien (tr9ng so xu at hien) cua tir kh6a trong trang web,
ach bieu dien bo sung din coi URL la mot doi tirong chinh. Xuat phat tir bang urlword hru
rir cac thong tin ve cac URL, chung toi xay dung vector bieu dien cua trang web.
Phuong phap thirc hien nhir sau: trong bang urlword, them mot tnrong moi, co ten
ontenLvector; truong nay co kieu gidng nhir kieu cua trtrong urIs trong bang wordurl.
'rirong nay hru trir cac thong tin ve vector bieu dien cho trang web tirorig irng co ma nhan
ang hru trong trirong urLid cua cung bang. Cac t.nrorig trong bang urlword diroc mo ta
rang bang sau (da hroc bat cac
truong
khong lien quan):
Ten tr uo'ng
Mieu
ta
urLid
Ma nhan dang cua URL (cua trang web)
site.Id
Ma nhan dang cua site chira trang do
urI
N9i dung cua URL cua trang
content.,
vector
Thong tin ve vector bieu dien URL (nhan gia tri rang neu kich thuoc
thong tin> 1000 byte, va thong tin se diroc hru trir trong file nhi
phan co ten la urlword.content.vector )
.
Cau true cua file urlword.content-vector dircc mieu ta nhir sau:
Thong tin
ve
cec tii xUllt hi~n trong URL, tuioc s§,p xep theo
woid.id
Vi trf
D9
dai
Mieu
ta
0 4
Word.id
(ma nhan dang cua tir thir nhat xuat hien trong
URL)
4
2
Trong so cua tir thir nhat xuat hien trong URL
6
4 Word.id (rna nhan dang cua tir thir hai xuat hien trong URL)
10
2
Trong so cua tir tlnr hai xuat hien trong URL
L?p cho cec tu tiep theo xuat hi~n trong URL
t
c
k
c
t
v
v
CIAl PHAp TIM KlEM TRANG WEB TUONG
TV
TRONG MAy TIM KlEM VIETSEEK
301
duoc thong tin ve tlm so xuat hien cua cac
i
ir trong moi trang va thong tin ve moi lien ket
giua trang dang xet voi cac trang lang gieng. va tir do tinh diroc trong so cua moi tu.· Khi
ca
sa
dii lieu diroc t9-0 chi muc 19-i(sau khoa ng thai gian nhat dinh) thi gia tri cua tnro ng
nay
cling diroc tinh toan luon trong qua trinh t9-Ochi muc.
Viec them trirong eontenLveetor VaGca
sa
dir lieu khong lam anh huang den su hoat
d9ngcua toan bo h~ thong Vietseek cling nhir .ac mod un tim kiern, t9-0 chi muc VIcac lenh
thao tac voi CSDL dir lieu aeu chi ro cac tnro ng can thao tac. Do do viec them trtrong rnoi
hoan
toan khong anh huang
toi
cac
hoat dong -;Knco
cua
h~ thong.
Do so hrcng cac trang web la rat Ian nen viec tinh toan va so sanh d9 gan nhau giira
vector bieu dien cua mot trang dang xet voi ca.: trang con 19-itrong ca
sa
dir lieu chKc chan
set6n thai gian. Giai phap khac phuc cua chung toi la, vo
i
moi URL, chiing toi t9-0 luon
m9t danh sach cac URL tirong tv voi no, tire la gan nhat voi no. Viec hru trir cac URL nay
duoc to chirc tuang tv nhir viec to chirc hru trir cac sieu lien ket giira cac trang. Cu the la
tuong tv nhir bang
citation.
S6 hrong cac URL nay dircc gioi han bo
i
mot ngircng ve s6
IUQ'ng(khoang 100 URL co d9
tuong
tv cao nhat
i,
VI thong thirong
nguo
i
Slr dung chi quan
tam nhieu nhat den 20 trang dau
tien.
4.5. Cac t.huat toan
Thuat toan 1.
(T9-o
content.
vector)
(1) word +- tir khoa dau tien trong bang word url (word chira diroc xet)
(2) while (trong bang wordurl con tir khoa chir. ducc xet) thuc hien
{Xet word}
(2.1) Lay ra danh
sach
URL tuang irng
voi '"
ord,
(2.2) url +- URL dau tien trong danh sach (u rl chira diroc xet)
(2.3) while (trong danh sach con URL chira dHQ'Cxet ) thirc hien
{ Xet
url -
Tinh trong
s6
cua
word trong url }
(2.3.1) Lay
n1
= tong so tir
xuat hien
troll'S url (co sKn trong bang wordurl.urls)
(2.3.2) Tham chieu theo url.id den bang ci ration de co diroc thong tin ve cac
URL co lien ket den url
(2.3.3) Tinh n2
va
n3
(2.3.4) Tinh nw theo cong thirc nw = [(4
*
11
+ 2
*
n2 + n3)/7]
(2.3.5) Bo sung thong tin ve word
hien tai
(gom
word.id, trong
so nw) VaG
cuoi file
urlword.contenLvector
(2.3.6) url +- URL tiep theo trong danh sad
l
{het while (2.3)}
(2.4) word +- tir khoa tiep theo trong bang wordurl
[het while (2)}
{Het Thu~t toan I}
Thuat toan 2. (T9-o danh sach cac URL "gan noi dung" irng voi URL)
[Cac URL
ducc
xep theo tang theo chi so
s:
1,2, ,
N,
trong do N la so hrong trang
Web trong h~ thong}
1.I+-1
2.
J
+- I +
1
3. Tinh
dIJ
= d9 gan nhau cua URLI voi URLJ
4. If
dIJ
co the diroc dira VaGURLI
302
PHAM TH~ THANH NAM, BUI QUANG MINH, HA QUANG TH{jY
then
Dira dIJ VaGURLI (bao gorn gia tri dIJ va chi so
J).
De thuat toan hoat dong
nhanh chung ta
Sl'r
dung danh sach
cac
dIJ trong URLI oUQ'Csap xep giam
dan
ve
gia tri
5. If dIJ co the oUQ'Cdira VaG URLJ
then Dira dIJ VaG URLJ
(bao
gom
gia tri
dIJ
va
chi so
1)
6.
J
f-
J
+
1
7. If
J ::;
N
then Chuyen ve 3
8. I
f-
1+1
9. If
1< N
then Chuyen ve 2
10. Ket thuc
{Het Thuat toan 2}
Trong thuat toan nay co hai bai toan con din giai quyet:
- Kiern tra co dira diroc dI,J VaG URL
I
(hoac URL
J
) hay khong.
VI
moi URL chi can
hru 100 Ian can gan nhat voi no, khi thuat toan hoat dong, moi URL chi can clnra khong
qua 100 Ian can "hien thai gan nhat".
De thuan tien cho viec
tinh
toan, cac dI,J trong mot URL dircc xep theo gia tri
giam
dan va dung thuat toan chen nhi phan phan ttr dI,J VaGdanh sach da diroc sap. Neu vi tri
cua dI,J virot qua 100 thl khong dira dI,J vao danh sach.
- Cho dI,J VaG URL
I
(hoac URLJ): Dira VaGhai dai hrong, 00 la gia tri 09 gan dI,J
va
chi so
J
neu xem xet URL
I
(hoac chi so
I
neu xem xet URL
J
).
8tr dung ket qua cua Thuat toan 2, chung ta hoan toan co the xay dirng thuat toan tlm
kiem cac trang web gan noi dung
voi
trang web hien thai bling each hien thi danh sach
100
trang web tuemg irng vo'i trang web hien thai.
5.
KET
QUA
THue NGHIEM VA BAN LuAN
.
Khi trien khai thir nghiem, Viet seek oa xay dung diroc chi muc cho khoang 3000 site
tieng Vi~t
vo
i
khoang 3 trieu trang web. Khoang 2,5 trieu tir khoa oa diroc hru trfr.
Hien tai, Viet seek oa co chirc nang tim kiern theo van ban cua mot may tirn kiem thong
thiro ng (hinh 4). Cac ket qua tim kiern oUQ'Ctd ve rat nhanh va chinh xac do oa thirc hi~n
diroc viec tinh hang trang web dua theo cac lien ket ngay tir khi tao chi muc cho cac trang
va viec xep hang hien thi trang ket qua oa diroc tinh toan dira theo bon tieu chi OI1Q'c
neu
a
phan tren. Viet seek oa chuyen ooi oUQ'Ctat ca cac loai ma tieng Viet khac nhau
(TCVN,
VNI, VIQR) sang ma Unicode, va ket qua oUQ'Ctra lai diroi dang ma Unicode.
Nhirng chirc nang tirn kiem hinhanh, tirn kiern trang web tucmg tv theo noi dung
veri
trang web hien thai theo cac thuat toan diroc oe xuat tren day con dang diroc cluing t6i
tich
hop
VaGViet seek.
Chung toi dang tiep tuc tien hanh nhirng nghien ciru dinh huang
toi
oe xu at bieu dien
mrri trang web tinh tuy hem, ch~ng han cai tien bieu dien trang web dira tren
ly
thuyet t~p
mo [7], bo sung chirc nang tv phat hien luat [2] hoac cung cap cac khung nhin cua Vietseek
cho tung linh virc hoat dong cua ngiroi dung (khoa h9C tv nhien, khoa h9C xa hoi, cong
ngh~
thong tin, kinh doanh ).
CIAl PHAp TIM KlEM TRANG WEB TUONG TV TRONG MAy TIM KlEM VIETSEEK
303
VictSec'k
netnam
TIm kiem
r
Off
r
T.aJI?~
c-
VNI
r
\I1c!R
.
.
: I I•••• '
Vi t.
-t
Sc c c c c c c c e e c c k It>
f(e, qua
1 ~ 3.
:!
5
Q
l
Q ~
lQllJ2 Tiep
1. NetNam Y:~~:i.c NetNam I~·I- " 1'-F""II
.,"ii'
'il ;II
N(!tNilUl Corp. ISP
~lflCI?
19')3. IC'P
slnc-oe:?OOl,
Network Solution
Provider 1378,
a~(.
O:'G
Portal
Cornp any m
Vwtn
arn
vietnam Provtder . 82Ej 82C 82G Porta! Company
In
Viet
narn
VIetnam,
\on,
Internet,
netnam,
I(llt,
nest,
ISP, ICP, lflrranE-t,
t.?~tr ;11I2t .
r :
NcrN:nn
' :Q1t"
ISP
oo»:e
'1')03. ICP since
2001, Net'.o.or.4.
Soluuon
Provider.
B2B.
B2C B21-;
PC','1 3J
C;orn/-"I."CJ
I,
~rl"'" •• UJ;:,.·fI \.1
":~k· ,~-' ' 1-
I.••.
2. NetNam
I
It- :;[Vl,:· .
, N(!t~4(un
Llfestv!p
the most
tntct
esnuq
VIt
-fnamese Ent
cu
ammunt
Maqa
zure
011
rh·,!net
vletn~m.
vn.
11I1~ult31.
nt!lndHl.
ton lo?chfl.)logy,
sort'w':'I-?, port
at c omput er SCI-?nce, 11,
Information,
application,
::ISP
NetNam
ICF' ""luslc home Th ann
vlfn .• ,
t
!
Net/l/ulIl
L,.'&,~t,/jo '
1I.'t~
rnOf:.":'
,,'}ter6·:!.>tm9
V!o!!t!l!jn~,-::'&1
Er~o:1:t",mrnenl
,\J"a,<;Jifllr,e
I).')
tt-e
/)ef
fr'l .tr "H~r'-'Irr,
)1/111'("11
.'1""
1',ld' "'''(."
(It."""",,, "dr",j'""'lnr<j
"\/II,,f J'::7.""_y·,,j():Cj,)ld-4nl:-::rlt",,
'\i71~9J:,:;:
:),1)
r
\1.'"
Ileril~I~X'!3!tt'
"'!y
3. NetNam· VV~·k'~.Il1'"1<:'NetNam l'c.P::'
II=P
'.vIC"Jlall<''''
, Nntt'am COIP , ISF'
$U1Ce
1993, Ie P
Since
:-!OO1.
Network
Solunon
Provider,
B~B, 82C _O?G
Pou
al Company
In
Vretn am
vietnam ,., Pro.•
,der.
828,
82(. B.lG
Portal
Comp anj
In
Vietnam vietriam , vn , mtemer , netnam. rou, nest,
ISP, ICP,
rmranet ,
l,,o:!rdrlet
t t
Nf!tN.'Jln
Corp
l5P
SI.">Ct;·
1993. ICP
-6,1.'/CI:?
2C}()1.
NC't'l.'l.v/ .•.
Sotuton
Provrcor. B2B, B2C.
B2(,
Pond!' Comoo
",' ~,,/ f
rIP 'II"'U
-if!!
",irrH,:';'I~ "'i~l"
.~Jk . =lo
ill' . I
h.
Hinh.
4,
Giao dien mot trang ket qua tirn kiern Vietseek
Uti
earn o'n. Chung toi chan thanh earn
em
Mang TTVN On line va Co quan VDC1
da
ho
tro ,
giiip
d6'
cluing
toi trong viec trien khai
thir
nghiem may tim kiern Vietseek.
TAl LI¢U TRAM KRAO
[1] Arvind Arasu, Junghoo Cho, Hector Garcia-Molina, Andreas Paepcke, and Sriram
Raghavan,
Searching the Web,
Technical Report, Computer Science Department, Stan-
ford University,
2000.
[2] Bettina Berendt,
Web Usage Mining, Site Semantics, and the Support of Navigation,
Humboldt University Berlin, Institute of Pedagogy and Informatics, Berlin, Germany,
2000,
[3] Holger Billhardt, Daniel Borrajo, and Victor Maojo, Context vector model for infor-
mation retrieval,
Journal of American Society for Information Science and Technology
(JASIS) 53 (2002) 236-249,
[4] Hwanjo Yu, Jiawei Han, and Kevin Chen-Chuan, PEBL: Positive example based learn-
ing for web page classification using SVM,
Proceeding of the Eighth ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining,
Aberta, Canada,
July
23-26, 2002, 239-248,
[5] Martin Ester, Hans-Peter Kriegei, and Matthias Schubert, Web site mmmg: A new
way to spot competitors, customers and suppliers in the world wide web,
Proceeding of
the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining,
Aberta, Canada, July
23-26, 2002, 249-258,
304
PH.A,.MTH~ THANH NAM, Bin QUANG MINH, HI QUANG THl)Y
[6] Nguyen Ngoc Minh, Nguyen Tri Thanh, Ha Quang Thuy, Luong Song Van, and Nguyen
Thi Van, A knowledge discovery model in fulltext databases, Proceedings of the
First
Workshop of International Joint Research: "Parallel Computing, Data Mining and
Op-
tical Networks", Japan Advanced Institute of Science and Technology (JAIST), Tat-
sunokuchi, Japan, March 7, 2001, 59-68.
[7] Sen Slattery, "Hypertext .classification", Doctoral dissertation (CMU-CS-02-142),
School of Computer Science, Carnegie Mellon University, 2002.
[8] Son Doan and Susumu Horiguchi, A new Text Representation Method using Fuzzy
Con-
cepts in Text Categorization, JAIST Science Reports, 2002.
[9] Thorsten Joachims, Optimizing search engines using clickthrough data, Proceeding
of
the Eighth ACM SIGKDD International Conference on Knowledge Discovery and
Data
Mining, Aberta, Canada, July 23-26, 2002, 133-142.
Nluiti bai ngay 25 - 8 -2003
Nluin. lr;Lisau su a ngay 21 - 6-2004