TRlTdNG DAI HOC AN GIANG
KHOA KY THUAT - CONG NGHE - MOI TRlTOfNG
DlTONG THANH TRlTC - DTH082062
KHOA LUAN TOT NGHIfiP DAI HOC NGANH CU* NHAN TIN HOC
TIM HIEU CAC KY THUAT PHAN LOAI
VAN BAN TIENG VIET
Giang vien huong dan
TS. Nguyen Van Hoa
TRUdNG DAI HQC
AN GIANG
thUvien
An Giang, 05/2012
LCfl CAM ON-
Truac tien, em muon giii lai cam on sau sac nhat den thay giao, Tien si Nguyen
Van Hoa nguoi da tan tinh huang dan em trong suot qua trinh thuc hien khoa luan tot
nghiep.
Em xin bay to lai cam an sau sac nhat den thay Ths. Ho Nha Phong, co giao
Ths. Nguyen Thi My Truyen cimg nhung thay co giao da tan tinh giang day em trong
bon nam qua, nhung kien thiic ma em nhan duac tren giang dirang dai hoc se la hanh
trang giiip em vung buac trong tuang lai.
Cu6i cimg, em muon gui lai cam on chan thanh den tat ca ban be, va dac biet la
cha me va em gai, nhung nguai luon kip thai dong vien va giiip da em vuat qua nhung
kho khan trong cupc song.
Sinh vien
Duong Thanh True
>'•' TOM TAT
Phan loai van ban la nhan biet npi dung trong van b^n can phan loai ndi ve cac
van de thupc chu de nao do. Day la van de rat dom gian doi vdi chung ta, nhung lai la
mot viec rat khd khan ddi vdi may. Viec hun luyen dS mot cai may hidu va phan loai
dupe cac van ban (vdi ti le chinh xac cao) hien dang la mpt van de nan giai, dele biet
doi vdi van ban tieng Viet. Hien nay, da cd kha nhieu nghien era ve van de nay va dat
dupe cac ket qua kha quan, tuy nhien con gap phai nhieu-vudng mac can duac giai
quydt. Trong khda luan nay toi se gidi thieu so lupc \h cac phuong phap phan loai van
ban da duac sii dung va tap trung vao hai phuong phap phan loai la su dung b6 phan
loai vector ho tra (SVM) va phuang phap phan loai theo hudng thong ke vi chung la
cac phuang phap duac danh gia cao va duac su dung nhieu trong luih vuc nay (chi tiet
v6 hai phuang phap nay se duac gidi thieu trong cac ph3n sau).
Qua trinh phan lo^i van ban duac chia lam 3 giai doan chinh:
-Giai doan chuan bi du lieu: Thu thap tap dO lieu mau, tach tir, tinh cac trong so
tu, loai bo cac tu tarn thudng khong cd y nghia phan loai, lira chpn cac dac trung.
-Giai doan huan luyen du lieu: Xay dimg cac mo hinh phan loai, tuy theo
phuang phap duac chpn ma se cd each xay drag bp phan loai khac nhau.
-Giai doan phan lap danh gia: Thu nghiem cac mo hinh phan loai da xay dung
tren nhrag van ban mdi, tinh toan dp chinh xac phan loai tir do tim ra each cai tien cac
mo hinh phan loai.
Toi da thtrc nghiem tren 5 chu de: giao due, phap luat, sue khde, thl thao, vi tinh.
Vdi mdi chu dd toi thu thap 200 van ban mau lam tap du lieu hoc va kidm tra (tdng
cpng 1000 van ban). Sau khi su dung phuang phap hold-out (lay ngau nhien A/3 t|p du
lieu de hoc va 1/3 tap du lieu con lai dung cho kiem tra, lap lai qua trinh nay 3 Ian rdi
lay gia tri trung binh) de danh gia hieu qua cua cac bp phan loai theo hai phuang phap
SVM va phuang phap thong ke thu dupe ket qua:
Phuong phap
Dp chinh xac
SVM
96.67%
Thdng ke
96.26%
MUCLUC
CHUONGUTONGQUAN'.1
1.1. BatvindS1
, 1.2. Lich su giai quyet van de.1
1.3.Pham vi cua de tai2
1.4.Phuang phap nghien cuu/ hudng giai quyet van d2
CHUONG 2: CO SCS LY THUYET4
2.1.Gidi thieu bai toan phan Ioai van ban tiSng Viet4
2.2.Mo hinh phan Ioai van ban4
2.1.1.Giai doan chuan bi dii lieu4
2.1.2.Giai doan hu^n luyen5
2.1.3.Giai doan phan lop vadanhgia5
2.3.C^c cong viec chinh trong qua trinh phan Ioai6
2.3.1.Chuin hoa van ban6
2.3.2.Tachtir6
2.3.3.Bieu dien van ban8
2.3.4.Trich chpn dac tnmg9
2.4.Cdc phuang phap phan Ioai van ban11
2.4.1.Phuang phap k lang giSng gn nhlt (kNN)11
2.4.2.Phuang phap Naive bayes12
2.4.3.Phuang phap cay quyet dinh:13
2.4.4.Phuang phap may hoc vecta ho tra (SVM)14
CHl/ONG 3: N0I DUNG VA K3ET QUA NGHIEN CU"U18
3.1.Qua trinh xay dung bo phan Ioai18
3.1.1.Mo hinh cac buac thuc hienphan Ioai18
3.1.2.Xay dvrng tap du lieu18
3.1.3.TiSnxulyvanban18
3.1.4.Lua chon dac trung20
3.1.5.Mo hinh hoa khong gian vector20
3.1.6.Xay dung bp phan Ioai21
3.1.7.Thu nghiem va danh gia21
3.2.Xay dung he thong phan Ioai van ban21
3.2.1.YeucSu—.21
3.2.2.Phantich22
3.2.3.ThiStkS128
3.3. Kit qua tare nghiem'...:.......55
3.3.1.Banh gia cac giai thuat55
3.3.2.So sanh cac giai thuat57
KET LUAN VA HUCJNG PHAT TRIEN61
TAI LIEU THAM KHAO'.62
PHU LUC A: DAC TA USECASE63
PHU LUC B: DANH SACH TIT THUC5NG88
DANH SACH HINH VE
Hinh 1: Gdn nhdn cho cac tdi lieu van ban4
Hinh 2: Mo hinh giai doqn chudn bi die lieu5
Hinh 3: Mo hinh giai doqn hudn luyen5
Hinh 4: Mo hinh giai doqnphdn lop6
Hinh 5: Biiu diin van ban„8
Hinh 6: Mat sieu phdng phdn tdch cac mdu duong khoi cac mdu dm14
Hinh 7: Mo hinh cac btcoc thuc hienphdn loqi van ban18
Hinh 8: Usecase tdng quan23
Hinh 9: So do Usecacse chitc nangphdn loqi24
Hinh 10: So do Usecase chitc ndng quan ly die lieu24
Hinh 11: So do Usecase chiec ndng quan ly dqc trieng van ban25
Hinh 12: So do Usecase chic ndng quan ly dqc trieng chu di25
Hinh 13: So do Usecase chic ndng quan ly ti thieong26
Hinh 14: So do Usecase chic ndng quan ly ti biiu diin26
Hinh 15: So do Usecase chic ndng quan ly tap die lieu hoc27
Hinh 16: So do Usecase chic ndng quan ly bo phdn loqi27
Hinh 17: So do Usecase nhom chic ndng dang nhdp hi thong28
Hinh 18: Kiin true hi thing28
Hinh 19: So do chic ndng he thing29
Hinh 20: So d6 giao dien he th6ng30
Hinh 21:Giao diin chinh chitong trinh30
Hinh 22: Giao diin phdn loqi van ban31
Hinh 23: So do hoqt dqng chic nangphdn loqi van ban32
Hinh 24: Giao dienphdn loqi van ban32
Hinh 25:So do hoqt dqng chic nangphdn loqi van ban33
Hinh 26: Giao dien thim chu di moi33
Hinh 27:So do hoqt dqng chic ndng thim chu di34
Hinh 28: Giao diin quan ly chu di35
Hinh 29:Scr d6 hoat dong chuc nang quan ly chu de35
Hinh 30: Giao diin thim van ban moi36
Hinh 31: So do hoqt dqng giao diin thim van ban37
Hinh 32: Giao diin quan ly van ban38
Hinh 33:So do hoqt dqng giao diin quan ly van ban39
Hinh 34: Giao diin quan lyddc trieng van ban40
Hinh 35: So do hoqt dqng giao diin quan lyddc trieng van ban41
Hinh 36:Giao diin quan ly dqc trieng chu di41
Hinh 37: So do hoqt dqng giqo dien quan lyddc trieng chu di42
Hinh 38: Giao diin quan ly titthitdng43
Hinh 39: So do hoqt dqng giao diin quan ly tie thieong44
Hinh 40:Giao diin quan ly tie biiu dien van ban44
Hinh 41: So do hoqt dqng giao diin quan ly tit biiu diin45
Hinh 42: Giao diin quan ly tap die lieu hoc46
Hinh 43: So do hoqt dqng giqo diin quan ly tap die lieu hoc47
Hinh 44: Giao diin xudt tap die lieu47
Hinh 45:Sodo hoqt dqng giao dien xudt tap die lieu hoc48
Hinh 46: Giao dien xdy dung bo phan loai tuddng49
Hinh 47: Sa dS hoat donggiao dien xdy dung bophdn loai tu dong.-50
Hinh 48: Giao dien qudn ly bophdn loai50
Hinh 49: Scr do hoat donggiao dien xdy dung bqphdn loai tu dong51
Hinh 50: Sadd quan he (CSDL)51
'•*DANH SACH BANG BIEU
Bang 1: Chudn hoa bo ddu19
Bang 2: Danh sack cdc Actor23
Bang 3: CdU hinh he thong=29
Bang 4: Sir dung du lieu giao dien phdn loaivan ban31
Bang 5: Sir dung du lieu giao dien phdn loai thu muc32
Bang 6: Sir dung du lieu giao dien them chu di mai33
Bang 7: Sir dung du lieu giao dien qudn ly chu de35
Bang 8: Sir dung du lieu giao dien them van ban mai36
Bang 9: Sir dung du lieu giao dien qudn ly van ban38
Bang 10: Sir dung du lieu giao dien dqc trung van ban40
Bang 11: Sir dung du lieu giao dien dqc trung chu de42
Bang 12: Sir dung du lieu giao dien qudn ly tir thuang43
Bang 13: Sir dung du lieu giao dien tir biiu dien van ban:.45
Bang 14: Sir dung du lieu giao dien qudn ly tap du lieu hoc46
Bang 15: Sir dung dulieu giao dien xudt tap du lieu hoc48
Bang 16: Sir dung du lieu giao dien tudqngxdy dung bo phdn loai49
Bang 17: Su dung du lieu giao dien qudn lybqphdn loai51
Bang 18: Cdu true bang chu de52
Bang 19: Cdu true bdngvdn ban52
Bang 20: Cdu true bang dqc trung chu de53
Bang 21: Cdu true bang dqc trung van ban53
Bang 22: Cdu true bang tir bieu dien54
Bang 23: Cau true bang tir thuang54
Bang 24: Cau trite bang bq phdn loai54
Bang 25: Cdu true bang tdi khodn55
Bang 26: Ma trdn confusion trinh bay kit qua phdn loai gidi thudt SVM Ian 155
Bang 27: Ma trdn confusion trinh bay kit qua phdn loai gidi thudt SVM Idn 256
Bang 28: Ma trdn confusion trinh bay kit ^ua phdn loai gidi thudt SVM Ian 356
Bang 29: Ma trdn confusion trinh bay kit qua phdn loai gidi thudt Thong ki Idn 1 ....56
Bang 30: Ma trdn confusion trinh bay kit qua phdn loai gidi thudt Thong ki Idn 2....51
Bang 31: Ma trdn confusion trinh bay kit qudphdn loai gidi thudt Thongki Idn 3....51
Bdng,32: Usecasephdn loaivan ban64
Bang 33: Usecasephdn loai thu muc65
Bang34: Usecase thim chu de.'. 65
Bang 35: Usecase xoa chu di67
Bang 36: Usecase thim van ban68
Bang 37: Usecase xoa van ban69
Bang 38: Usecase tdch tir70
Bang 39: Usecase Chudn hoa van ban71
Bang 40: Usecase loai tir thuang van ban72
Bang 41: Usecase tim dqc trung van ban73
Bdng42: Usecasexoa dqc trungvdn ban73
Bang 43: Usecase tim dqc trung chu di74
Bang 44: Usecasexoa dqc trung chu di75
Bang 45: Usecase thim tir thuang76
Bang 46: Usecase Xda ticthiccmg:.•:77
Bang 47: Usecase Xoa tie biiu diin78
Bang 48: Usecase xem danhsdch tap die lieu hoc79
Bang 49: Usecase tao tap die lieu hoc80
Bang 50: Usecase sao Iteu tap die lieu hoc81
Bang 51: Usecase phuc hoi tap die lieu hoc82
Bang 52: Usecase xudt tap die lieu hoc83
Bang 53: Usecase xay dung bo phan loai tie dong84
Bang 54: Usecase tao bo phan loai85
Bang 55: Usecase ddnh gid bo phan loai86
Bang 56: Usecase thie nghiem bo phan loai87
Bang 57: Danh sdeh tic thitcmg89
DANH SACH CAC TIT VIET TAT
Tit viet tSt
Tfr day du
SVM
Support Vector Machine
VB
Van ban
CD
ChudS
DTVB
Dae tnmg van ban
TDLH
Tap da lieu hoc
DTCD
Dae tnmg chii dS
BD
Bieu diln
BPL
Bo phan loai
CSDL
Co so du lieu
Tim hieu cac ky thuat phan loai van ban tieng Viet
CHtTGNG 1: TONG QUAN
1.1.Bat van dl
Cong nghe thong tin xuat hien da lam thay doi ca the gidi, mpt 6 ciing chi bang
ban tay co the chlia lupng du lieu bang ca mot can phong ldn vdi day sach. Ngay nay,
nhilu ngu6n thong tin duai dang van ban da dupe chuy&i din sang dpng du lieu dupe
luu trii tren may tinh hoac truyen tai tren mang. Bai vi nhung im diem: Lim trii gpn
nhe, thai gian luu trii lau dai, thuan tien trong su dung va trao doi,... nen nguon du
lieu nay tao thanh mot khoi lupng khdng 16 cac thu vien dien tii, thu dien hi (email),
world-wide-web, va cac du lieu dupe luu trii tren may tinh ca nhan,... Cimg vdi su gia
tang ciia so lupng van ban, nhu cau tim kiem van ban cung tang theo. Khi do, phan
loai van ban tu dpng la mot yeu cau cap thiet dupe dat ra . Phan loai van ban se giup
chting ta tim kiem thong tin mot each nhanh chdng hom thay vi phai tim Ian lupt trong
timg van ban, hem nua khi so lupng van ban dang gia tang mot each nhanh chong thi
thao tac tim Ian lupt trong timg van ban se mat rat nhieu thai gian, cong sue va la mot
cong viec nham chan va khong kha thi. Chinh vi the nhu cau phan loai van ban tu
dpng la thuc su can thiSt.
Bai toan phan loai van ban co y nghTa rat quan trong trong viec xti ly du lieu van
ban va dupe ling dung rpng rai trong nhi^u linh vuc nhu: Tim kiem, trich lpc thong tin,
lpc spam e-mail, phan loai e mail, phan loai tin hie tu dpng... va no con la ca sa, dpng
luc thuc d^^y cac linh vuc nghiSn ciiu khac phat triln.
Bai toan phan loai tu dpng la mot trong nhung bai toan kinh dien trong linh vuc
xu ly du lieu van ban. Bay la van de co vai tro quan trong khi phai xii ly mot so lupng
Ian du lieu. Tren the giai da co nhi6u cong trinh nghien ciiu va dat dupe nhung kit q^ua
kha quan v6 huang nay. Tuy vay, cac nghien ciiu va ling dung d6i vai van ban ti^ng
Viet con nhi6u han ch6. Phln nhieu ly do la dac thu cua tiSng Viet tren phucrng dien tu
vung va cau. Co nhi6u phuong phap phan loai van ban da dupe sir dung nhu: Quyet
dinh Bayes, cay quyet dinh, k-lang gieng, mang noron,...Nhung phuang phap nay cho
ket qua co the chap nhan dupe va dupe sii dung trong thuc te. Trong nhung nam g4n
day, phuang phap phan loai su dung Bp phan loai vector ho trp (SVM) dupe quan tam
va su dung nhieu trong linh vuc nhan dang va phan loai. So sanh vai cac phuang phap
phSn loai khac, kha nang phan loai ciia SVM la tuong duang hoac tot han dang ke [5].
1.2.Lich su' giai quyet van de
Van dh phan loai van ban da dupe nhi^u nguai quan tam va nghien ciiu trong
nhung nam gin day. Nhi^u cong trinh nghien ciiu tren cac ngon ngu Tieng Anh va cac
ngon ngu khac dat dupe nhieu ket qua kha quan. Mot so nghien ciiu trong linh vuc nay
nhu: Dua tren cac th6ng ke cua Yang&Xin(1999). [13], Support Vector Machine [8],...
B6i vai ti&ig Viet, cung da co rat nhiSu nghien ciiu nhu: Phan loai van b^n tieng
Viet vai bp phan loai vecta ho trp SVM [5] Error! Referenee-sourxe-HoHFound-.,
Nghien ciiu ling dung tap pho bien va luat ket hop vao bai toan phan loai van ban tieng
Viet co xem xet ngu nghTa [2] , phan loai van ban bang phuang phap cay quyet dinh
[6] ,Nhin chung, nhung each tiep can nay deu cho ket qua chap nhan dupe. Tuy
nhien, van con mot so han che do nhung dac thu cua van ban tieng Viet ve phuang
dien tu vung va cau din d^n hieu qua phan loai giam.
SV: Duong Thanh True - DTH082062Trang 1
Tim hieu cac ky thuat phan loai van ban tieng Viet
1.3^Phamvicuadltai
Trong khoa luan nay toi se thuc hien mot so van de:
-Gidi thieu sa lupc vl bai toan phan loai van ban.
-Cac van de lien quan din cong viec phan loai van ban nhu: Tach tir, bilu diln
van ban,...
-Trinh bay cac giai thuat phan loai van ban da duqc su dung.
-Nhung v3n de dac biet khi phan loai van ban tilng Viet.
-Xay dung chuomg trinh phan loai van ban tieng Viet su dung giai thuat SVM
va phuong phap thong ke.
Viec phan loai van ban se xac dinh mot van ban thupc chu d nao trong cac chu
de cho trudc hoac khong xac dinh duqc. So lupng cac chu de co th6 duqc ma rpng tuy
y. Trong khoa luan nay, toi se xay dung 5 chu dl la: Giao due, phap luat, sue khoe, the
thao, yi tinh. Vdi moi chu de, toi se thu thap 200 van ban mlu dimg lam tap du lieu
hoc kiem tra.
1.4.Phuffng phap nghien c^u/ hirong giai quyet van de
Cac van de Ion can giai quyet trong de tai nay la:
-Nghien ciiu ly thuyet, giai thuat phan loai van ban: Tim hieu cac giai thuat phan
loai van ban da duqc su dung va hieu qua cua cac giai thuat nay. Xay dung chucmg
trinh so sanh hieu qua cac giai thuat phan loai.
-Quy trinh phan loai van ban: Tim hieu cac quy trinh phan loai da duqc su dung
(chu yu trong hai tai lieu: Phan loai van ban tigng Viet voi bq ph^n loai vector h6 trq
SVM [5] , phan loai van ban ti&ig Viet bang phuong phap cay cpySt dinh [6] , lira
chqn va hieu chinh quy trinh phan loai phu hop vdi tinh hinh thuc te.
-Cac van de lien quan den phan loai
oTach tu: tach tir trong van ban tieng Viet la cong viec het sue kho khan
vi nhung dac thu trong van ban tieng Viet. Da co rat nhieu tac gia nghien ciiu v6 van
de nay va dat duqc ket qua tot. Trong khoa luan nay toi su dung cong cu tach tir
vnTokenizer 4.1.1 [1]
oTrich chqn dac trung: Trong van ban co rat nhieu tir khong co y nghia
phan loai, nen chiing ta can loai bo nhung tir nay ra khoi van ban khi bieu dien. Cong
viec trich chpn dac trung se chqn ra cac tir mang y nghia phan loai. Cong viec nay se
duqc thuc hien theo 2 hudng la thu cong va tu dqng.
oBieu dien van ban: Be may tinh hieu duqc y nghia cua van ban va phan
biet duqc van ban nay vdi van ban khac, doi hoi phai bilu diln van ban dudi mot dang
nao do. Co nhieu each bi6u diln van ban duqc su dung va dat hieu qua cao nhu bilu
dien dudi dang vecto, dang cay cii phap,... . Trong khoa luan nay toi sir dung phuong
phap bieu dien theo dang vectcr vi day la phuong phap don gian va dap ling duqc yeu
cau (chi tiet se duqc trinh bay trong phan sau).
-Xay dung mo hinh SVM: Giai thuat SVM da duqc rat nhieu chuyen gia nghien
cuu tir rat lau va dat duqc nhieu thanh tuu Idn. SVM da duqc xay dimg thanh cac thu
vien theo timg muc dich su dung khac nhau. Trong khoa luan nay toi sir dung thu vien
SVM.NET ciia tac gia MATTEW JOHNSON phien ban 1.63 [1]
-Xay dung bq phan loai SVM: Mot van ban co the co nhung dac trung thuqc
nhieu chu de khac nhau. Vi vay, toi se xay dirng nhieu bo phan loai khac nhau, moi bo
phan loai se phan loeu duqc 2 chu de. Do do, chiing ta se co n*(n-l)/2 bq phan loai
SV: Duong Thanh True - DTH082062Trang 2
Tim hi6u cac ky thuat phan loai van ban tigng Viet
(vdi n la so chu de). Mot van ban can phan loai se duoc phan loai vdi tat ca cac bo
phan loai nay. Neu van ban thuoc chu de nao thi diem cho chu de do duoc tang len.
Cuoi cung ta se chpn chii dk co s6 dilm Ion nhat hoac chu d dlu tien trong nhiSu chu
dS cimg dilm.
- Xay dimg tap du lieu hoc: Trong khoa Iu8n nay toi xay dung tap du lieu cho 5
chii de la: giao due, phap luat, siic khoe, thS thao, vi tinh. Moi chii de toi se thu thap
200 van ban, cac van ban nay duac thu thap chii ySu tu cac trang tin tiic:
vnexpress.net, tuoitre.com, thanhnien.com.vn, dantri.com.vn, tienphong.vn. Cac van
ban nay se ducrc chuan hoa, trich chon, bieu dien, tong hop lai thanh nhieu tap du lieu
tuong ling vdi cac bo phan loai.
oThu nghiem danh gia: Su dimg cac phuong phap danh gia hieu qua phan
loai trong ltnh vuc nay, chu y^u la phuong phap h-eut ^ioj^-out.
SV: Duong Thanh True - DTH082062Trang 3
Tim hieu cac ky thuat phan loai van ban tieng Viet
CHlTGNG 2: CO SO LY THUYET
2.1.
Gioi thieu bai to an phan loai van ban tieng Viet
Bat toan phan loai van ban la mot trong nhung bai toan kho trong linh vuc xu ly
du lieu van ban. Viec giai bai toan nay chinh la viec gan nhan cho timg van bin thupc
mot trong cac chu de cho tnroc.'
Chu d 1
VB1
/
Chu d6 2
CM dl n
Chu dS 3
/-
/1
VB2
VB3
VB4
VB5
VBn
Hinh 1: Gan nhan cho cac tai lieu van ban
Bai toan phan loai van ban dupe chia lam hai loai chinh:
-Phan loai don nhan: Mot van ban chi dupe gan mot nhan duy nhat.
-Phan loai da nhan: Mt van ban co the dupe gan nhieu nhan.
Bai toan phan loai van bin co y nghTa rat quan trong trong viec xii ly du lieu van
ban va dupe ling dung rang rai trong nhieu lmh vuc nhu: Tim kiem, trich lpc thong tin,
lpc spam e-mail, phan loai e mail, phan loai tin hie tu d6ng... va no con la ca sa, dpng
lire thiic dly cac lmh vuc nghien cuu khac phat trien.
Hien nay chiing ta da va dang tiep nhan mot khoi luong du lieu khong 16 ttr mpi
lmh vuc, viec khai thac va tim kiem tri thiic trong kh6i du lieu kh6ng lo do la viec lam
nit c^n thi^t va dupe nhieu nha nghien cuu quan tarn. May hoc (Machine Learning) la
mot trong nhung huong tiep can khai mo du lieu dat dupe nhi^u thanh tuu nhlt hien
nay.
Lmh vuc may hoc dupe phan lam 3 loai: Hoc co giam sat, hoc khong gjam sat,
hoc tang cuomg. Hien nay cac phuong phap hoc co giam sat dupe su dung nhieu trong
bai toan phan loai van ban va dat dupe nhieu thanh cong.
2.2.
Mo hinh phan loai van ban
Viec phan loai van ban theo cac phuomg phap hoc co giam sat dupe chia lam 3
giai doan chinh: Chuan bi du lieu, Man luyen du lieu, phan loai va danh gia ket qua.
2.1.1. Giai doan chuan bi du* lieu
Bay la giai doan dau ti6n trong qua trinh phan loai van ban, ket qua cua giai doan
nay la tao ra mot khong gian vector lam ca sa cho giai thuat hoc sau nay. Bay la giai
doan quan trong va co anh huong rit Ion den hieu qua cua bp phan loai sau nay vi neu
cac tri thuc khong dly du va chinh xac thi khong thl nao M^n luyen dupe mot bp phan
loai vai hieu qua cao.
Giai doan nay bao gom cac cong viec sau:
-Thu thap du lieu mau: Bay la van de kha quan trong, cong viec nay doi hoi ton
kha nhieu thai gian va cong sue. chung ta co the lira chpn du lieu tir nhieu nguon khac
nhau, tuy nhien phai ddm bao cac du lieu thu dupe phai co dp phan loai chuan va miic
dp tuong tu cua cac van ban.
SV: Buong Thanh True - BTH082062
Trang4
Tim hieu cac ky thuat phan loai van ban tieng Viet
-TiSn xu ly van bamiGhiiy&i ddi tai lieu trong kho du lieu thanh cac hinh thiic
phu hop de cho giai thuat hoc co the hi^u va phan tich dirge.
-Bilu diln van ban: Ma hoa van ban boi mot mo hinh vector trgng so phu hop
voi giai thuat hoc dugc chgn.
-Trich chon dac (rung: Lira chon cac tCr co y nghia phan loai cao, va loai bo cac
tu hoac thuoc tinh khong mang y nghia phan loai ra khoi tap du lieu nham nang cao
hieu suit phan loai va giam thai gian huan luyen.
Tiln xu ly
Tap du
lieu
Khong gian vector
cho giai thuat hgc
Bieu dien
Trich chon dac trung
Hinh 2: Mo hinh giai doan chudn bi die lieu
2.1.2. Giai doan huan luyen
Sau khi da xay dung xong tap du lieu hoc, chung ta se sir dung cac giai thuat hoc
da chon trudc do nhu: SVM, cay quyet dinh, kNN, Naive Bayes,...dS huln luyen tren
tap dir lieu hoc nay. Ket qua cua giai doan nay chiing ta se thu dugc cac bg phan lop
Khong gian vector
cho giai thuat hoc
May hgc
Bg phan lop
Hinh 3: Mo hinh giai doan huan luyen
2.1.3. Giai doan phan lo^ va danh gia
Be thuc hien phan loai mot tai lieu chiing ta phai thuc hien cac budc cua giai
doan chuan bi du lieu doi vai tai lieu nay. Ket qua tao thanh mot khong gian vector va
dua vao bo phan loai de phan loai.
Viec danh gia bo phan loai dugc chia lam hai mat la danh gia tren tap du lieu
hgc va danh gia tren cac du lieu mai. Cln luu y chgn do do phu hop vdi giai thuat hgc
SV: Duong Thanh True - DTH082062
Trang5
Tim hieu cac ky thuat phan loai van ban tieng Viet
•:;<• ••
!!
Tai lieu mbi
lxiily
Bieu dien
L
c—
**}
Khong gian vector
cho giai thuat hoc
Bp phan lop
Trich chpn dac tnmg
>l Tai lieu duoc Dhan loai
Hinh 4: Mo hinh giai doanphdn lap
2.3.
Cac cong viec chinh trong qua trinh phan loai
2.3.1.ChuSn hoa van ban
Be he thdng phan loai co thl truy cap dupe cac van ban, doi hoi cac van ban phai
dupe dinh dang theo mot quy tac chung. Ngucri ta thuong dung plain text (van ban
thuan tiiy) lam dinh dang cho cac tap tin hoc va tap tin mdi can phan loai.
Du lieu cho he thdng phan loai dupe thu thap tur nhieu nguon khac nhau nen kho
tranh khoi gap cac I6i v6 viet sai chinh ta hoac 16i ngu phap,... Bieu nay anh hudng rlt
nhieu den viec tach tir va xay dung he thdng phan loai. Be nang cao hieu qua hoc va
phan loai cua he thong phan loai chiing ta can loai bo hoac chinh sua lai cac loi nay
trudc khi dua van ban vao he thong.
2.3.2.Tachtu
a.Vai tro cua tach tir
Tach tir co vai tro rat quan trpng trong bai toan phan loai van ban, no giup cho
giai thuat hoc co the hieu va phan tich dupe van ban. Neu tach tu khong chinh xac co
the dan den hieu sai y nghta van ban. Moi ngon ngu tu nhien co nhung dac thu rieng
nen viec tach tir tren cac ngon ngu khac nhau se co nhung diem khac nhau. Chang han,
doi vdi van ban tieng Anh moi tir se la mot tieng va each nhau bai dau khoang trang
nhung tieng Viet thi khong. Moi tir trong tieng Viet co the gom nhieu tieng va con co
nhieu y nghia khac nhau tuy thupc vao ngu canh trong cau.
b.Thuat toan Maximum Matching
Thuat toan nay co 2 dang:
-Bang dan gian: Bung de giai quyet nhap nhang tir dan (Yi-Ru-Li, 1995). Y
tuang cua dang nay, gia su co mot chuoi ky tu Cj, C2, C3,...,Cn. Buyet chuoi bat dau
tir ki tu dau tign cua chuoi, Ian lupt kiem tra _Ci_ co phai la tir hay khong, sau do kiem
tra _CiC2_ co phai la tir hay khong. Tiep tuc nhu the cho den khi tim dupe tir dai nhat
co trong tir dien. Chpn tir do, sau do tiep tuc qua trinh tren nhung tir con lai cho den
khi xac dinh dupe toan bp cac tir.
- Bang phiic tap: Bay la dang bien the khac cua thuat toan Maximum Matching
do Chen va Liu (1992) de xuat, no phiic tap han nhieu so vdi dang dan gian. Ho cho
SV: Buong Thanh True - BTH082062
Trang 6
Tim hieu cac ky thuat phan loai van ban tigng Viet
rang phan tich hop ly nhat dg chpn ra tir la phan tich tren bp ba tir co chigu dai ldn
nhat. Bat dau tir dau tign cua chudi neu co sir nhap nhang (vi du _Cj_ la tir nhung
_C]C2_ cung la tir, ...) thi chiing ta tim nhung tir kg tigp bit dau tir hai tir do, tucrng tir
nhu vay cho den khi chiing ta tlm dupe tat ca cac bo ba tir. Sau do, chpn ra bp ba tir co
chieu dai ion nhat. Gia sir ta co bp ba tir dai nhat nhu sau:
1._CiC2C2C4_
2._CiC2C3C4Cs_
3._Ci C2C3C4C5C^_
Can cu vao bp ba tir nay, tir dau tien QC2 cua bp ba thii ba la tir diing. Chiing ta
lay tir nay va tigp tuc qua trinh bat dau tir C3 cho d^n khi xac dinh dupe tir cuoi cung.
c.Mo hinh MMSEG
Mo hinh MMSEG la he thong nhan dign tir cho van ban tigng Quan thoai (Quoc
ngu Trung Quoc) do Chih_Hao_Tsai (1996) gidi thieu. Mo hinh nay dupe md rpng
dua trgn hai bign thg ciia thuat toan Maximum Matching. Bigm mdi cua mo hinh nay
la sir dung thgm ba luat khu nhap nhang niia. Hai trong ba luat nay dupe gidi thieu bdi
Chen va Liu (1992) va mot luat con lai do Chih_Hao_Tsai de xu^t.
Trong mo hinh nay, ChihHaoTsai da thuc nghiem trgn tigng Quan thoai va ket
qua dat dupe 98%. Bay la kit qua tuong doi cao so vdi cac phucmg pha^ khac. 6ng da
su dung bp tir dign gom 124,499 muc tir da tu (tuong duong vdi mot tigng trong tigng
Viet), chigu dai cua nhirng muc tir la 2 dgn 8 tu va tan suat su dung cua cac muc tir
don, gom 13,060 tir don dupe su dung trong luat bon cho viec khu nhap nhang tir. Chi
tiet thuat toan dupe mo ta nhu sau:
-Bang don gian: Boi vdi tu Cn trong chuoi cac tu, so khdp chuoi con bat dau vdi
tu Cn vdi tir trong tu dign dg tim tat ca cac tir so khdp co thg.
-Phiic tap: Bdi vdi tu Cn trong chuoi cac tu, tim tat ca cac bp ba tir bat dau bdi
Cn cd thg co, khdng quan tam tir dau tign co bi nhap nhang hay khdng. Nhung bp ba tir
nay chi dupe tao ra khi cd mot nhap nhang cua tir dau tign. Sau do su dung bon luat
khu nhap nhang sau de tim tir dung.
o Luat 1: Maximum matching (Chen & Liu, 1992)
• Maximum matching don gian: Lay tir cd chigu dai ldn nhat
•Maximum matchingphuc tap: Lay tir dau tign tir bp cd chigu dai dai nhlt, neu
cd nhigu hem mot bp dai nhat thi ap dung luat tigp theo.
o Luat 2: Chigu dai trung binh cua tir ldn nhlt (Chen & Liu, 1992). d cudi mdi
chuoi thudng gap nhung bp chi cd mpt hoac hai tir. Vi du, nhung bp sau cd cung dp
dai va cung bign doi chigu dai tir.
l-_Ci__C2__C3
2. _C,C2C3_
Luat 2 cho phep lay tir dau tign cua bp cd trung binh dp dai tir ldn nhat. trong vi
du tren ta se liy tir _C!C2C3_ tir bp thir hai. Gia thuygt cua luat nay la ta gap trudng
hop tir nhigu tu nhigu hon tir mot tu.
Luat nay chi cd lpi khi thigu mot hoac mot vai vi tri trong bp. Khi bp la bp ba thi
luat nay se khdng chinh xac. Bdi vi, bp ba tir cd cung tdng dp dai, di nhign se cd cung
dp dai trung binh. Vi thg, chiing ta can chpn giai phap khac.
oLuat 3: Bp bign ddi nhd nhat cua chigu dai tir (Chen & Liu, 1992). Gia
su, cd hai bp ba sau:
SV: Bucmg Thanh True - BTH082062Trang 7
Tim hieu cac ky thuat phan loai van ban tieng Viet
l._C,C2C3C4Gft_':>• •
•
2. ^CiC2C3__C4__C5C6
Luat 3 cho phep lly tir dlu tien vdi dp biln dli chieu dai nho nhit. Trong vi du
tren, ta se lly tir _C!C2_ tir bp dlu tien.
oLuat 4: Ting ldn nhit ciia muc dp tu do hinh vi ciia cac tir mp tu. vi du
cho hai bp cd ciing dp dai, dp bien doi va dp dai trung binh cua tir:
2. __CiC2C3^.4__
O day chung ta chi quan tarn den tir mpt tu, ca hai bp nay deu cd hai tir mpt tu.
Cong thuc dupe su dung dl tinh toan ting dp tu do hinh vi la logarit tin so cua tit ca
cac tir mpt tp trong mpt bp. Co so cho phep chuyen dli logarit nay la ciing mpt lupng
khac nhau vi tin si khong cd mpt anh hudng nhit dinh den day sap xep tit ca cac tin
si.
Luat 4 cho phep lly tir dlu tien cua bp vdi ting ldn nhat cua logarit tin si. Vi
khong the cd hai tu cd chinh xac ciing mpt tin si nen se khong cd nhap nhang sau khi
ap dung luat nay.
2.3.3. Biiu dien van ban
Be may hoc cd the hieu va phan tich dupe cac van ban thi chung ta can bieu dien
cac van ban theo mpt mo hinh nao do. Tuy thupc vao tirng thuat toan phan loai khac
nhau ma chung ta cd mo hinh bieu dien rieng. Mpt trong nhung mo hinh don gian va
thudng dupe su dung trong truimg hop nay la mo hinh khong gian vector. Trong mo
hinh nay moi van ban dupe bieu dien theo dang mpt vector
vb = (wllw2,...,wn)
Trong do, W; (vdi i=l,2,..,n) la trpng so cua tir thii i trong van ban va n la so tir
dung de bieu dien van ban. Vi du khi bieu dien van ban sau (hinh 5):
Hacker "mu xam" mang bi danh
D35m0ndl42 da khai thac 16 hing bao
mat de tham nhap vao may chu web ciia
^:__j>
ba website ldn, bao gom: Skype.com,
Oracle.com va UN.org.
1
0
1
1
0
illlC C0"
Vi tinh
mu
xam
The thao
Hinh 5: Bieu dien van ban
K6t qua ta thu dupe vector ~vb = (1,0,1,1,0,...)
a.Trpng s6 logic
Trpng s6 tir logic la phuong phap don gian nhit trong viec dinh trpng so tir.
Trong tiep can nay, gia tri ciia tir ki hieu la 1 neu no xuat hien trong tai lieu ngupc lai
nlu no khong xuat hien trong tai lieu ki hieu la 0.
b.Trpng s6 tin suit tir
SV: Duong Thanh True - DTH082062
Trang 8
Tim hidu cac ky thuat phan loai van ban tiSng Viet
Tan suat tir lars6 Ian xuat hien ciia tit do trong tai lieu ki hieu TF. Cach dinh trong
so tit nay cho rang mot tit la quan trong trong mot tai lieu neu no xuat hien nhieu Ian
trong tai lieu do.
Trong do: w; la gia tri ciia tit thii i, TF; la so Ian xuat hien cua tit thii i trong van
ban.
c.Trong sd TFJDF
Ngoai hai phuang phap tren thi phucmg phap TFJDF thong dung han. Phuang
phap nay dupe tinh bang tich cua tan suat tir TF va nghich dao cua tin suat tai lieu
(N\'
Wi = TFi * log I1
Trong do:
TF;: So ISn xuiit hien cua tit thii i trong van ban (tan suit tit)
N: Tdng sd van ban trong tap du lieu
DF;: So van ban co chiia tit thii,t trong tap du lieu
Trong trudng hop van ban co dp dai khac nhau thi tan suat tit co the thien ve cac
tai lieu dai. ChSng han nhu mot tit xuit hien 100 Ian trong tai lieu chiia 1000 tir thi tan
suat tir cua no se ldn hern mpt tit xuat hien 10 Ian trong tai lieu chiia 50 tit. Mac du tu
thii 2 co dp phan biet tot han tit thii nhat trong tai lieu ma no xuat hien. Khac phuc
trudng hop nay, ta cd cong thiic tinh tan suat tit cua tit q trong tai lieu nhu sau
....
i
Trong do: Tu so la so Ian xuat hien ciia tit thii i, mau so la tong so cac tit co trong
tai lieu.
Trong s6 TF_IDF thuong duac dimg dl tim cac tit dS biu diln van ban. Tir co
gia tri TF_IDF cang ldn thi kha nang phan biet cac tai H6u dua tren tit do cang cao.
2.3.4. Trich chon dac trirng
a.Loai tit thuong
Trong van ban co rlt nhigu tir khong that sir can thiSt va khong co y nghia trong
viec phan loai van ban duac gpi la nhung tit tam thudng ((iay sto^worcj1. Nhung tit nay
thupc nhung loai nhu tit quan he, tit lien ket c^u, cac chu s6, dau cau,... Cac tit nay
thudng xuat hien rat nhieu trong van ban va khong the hien npi dung phan loai cua van
ban do. Vi vay can phai loai bo cac tit nay ra khdi van ban de tao tinh rieng biet giua
cac van ban, gop phan giam chieu dong thai tang dp chinh xac va toe dp xu ly van ban.
Co nhieu phuong phap de loai bo tit tam thudng. Phuong phap co dien do la lap
danh sach liet ke cac tir tlm thudng can loai bo. Tuy each lam nay dan gian nhung
khong tong quat vi khong the liet ke het tat ca cac tit. Chung ta de dang nhan thay
rang, tit thudng la nhung tit co so Ian xuat hien qua it hoac qua nhieu trong cac van
ban. Chung ta co the dua vao tan suat tai lieu va dat nguang loai bo chiing.
b.• •
Gi^m chieu
Van ban sau khi duac bieu dien se tao thanh cac vector vdi so chieu chinh la so >
tir dung de bieu dien. Be bp phan loai lam viec co hieu qua thi can phai dimg mot so
luang rlt ldn cac tit bilu diin, di^u do lam cho viec hoc va phan loai cham di rat nhieu
va khong co hieu qua thuc tien. Trong vai trudng hap viec dimg qua nhieu cac dac
SV: Duong Thanh True-DTH082062Trang 9
Tim hieu cac ky thuat phan loai van ban tieng Viet
' trung d&bieu dien van ban lai lam giam hieu qua cua bp phan loai. Vi vay viec giam
so luong dac trung bieu dien van ban la mot viec lam rat can thiet.
Co hai hudng khac nhau trong viec giam so chieu, phu thupc vao nhiem vu giam
so chieu la bp phan hay tong the.
-Giam so chiiu bp phan: Voi mot lop ^, chpn nhung thupc tinh hay nhung tir ma
doi vdi lap Q no co dp lai thong tin nhat.
-Giam chieu tong the: Chpn nhung thupc tinh hay nhung tir co dp lpi thong tin
de thuc hien phan lap cho tat ca cac lap C = {ci, 02,03,..., Ck}.
Co nhieu phuang phap lam giam chieu cho bai toan phan loai van ban nhu: dp do
tuang ho (Mutual Information), thong ke Khi-binh phuang (Chi-Square Statistic) va
tan suat tai lieu (Document Frequency),...
oDp do tuang ho (Mutual Information):
Trong phan loai van ban, phuang phap nay su dung dp do luong tin tuang ho
giua moi tir va moi lap tai lieu de chpn cac tir tot nhat. Luong tin tuang ho giua tir t va
lap c duac tinh nhu sau:
t{o.i) ce{o,i}
Trong do:
P(t,c) la xac suat xuat hien dong thai cua tir t trong lap c
P(t) la xac suit xuit hien cua tir t
P(c) la xac suit xuit hien cua lop c.
Dp do MI toan cue (tinh tren toan bp tap tai lieu huin luyen) cho tir t duac tinh
nhu sau:
MIavg(t)=
o Thong kex2 (Chi-Square Statistic)
Thong ke Khi-binh phuang la phuang phap danh gia dp phu hap giua s6 lieu
quan sat va ky vpng. Ki hieu x2 l^ gia tri dp phu hop giOa cac tri s6 thuc te quan sat
(O) va cac tri so ly thuyet duac ky vpng (E), khi do cong thuc thong ke x2 co dang:
X ~
Trong bai toan phan loai van ban, phuang phap thong ke x2 tinh toan su phu
thuc giua tir t va lop c, gia tri x2 cang Ion danh gia muc dp uu tien cua tir t phu thupc
vao lap c cang nhieu.
Dp do x2 toan cue tinh tren toan bp tap huan luyen:
oTin suat tai lieu (Document Frequency):
Tan suat tai lieu cua mot tir la so luong tai lieu chiia tir do. Trong phuang phap
nay, ta se dat nguang de loai bo nhung tir co tin suat tai lieu nh6 hern hoac ldn han
nguang dinh truac. Do la nhung tir tim thucmg hay nhilng tir it thong dung gay ra loi
nhilu tir trong phan loai. Viec loai bo nhung tir nay nham cai thien dp chinh xac phan
SV: Duong Thanh True - DTH082062Trang 10
Tim hieu cac ky thuat phan loai van ban tieng Viet
loai. Tuy nhien c&n xem xet dS dat ngvrong loai bo nhung tvr thich hop vi tiin suit tai
lieu eua mot ttr- con the hien su quan trpng cua tvr do trong phan loai.
2.4.
Cac phuffng phap phan loai van ban
Co nhieu phuong phap giai thuat hoc tiep can cho bai toan phan loai van ban.
Moi phuong phap co nhung dac thu rieng va dem lai nhiing thanh cong nhat dinh. Cac
phuong phap dupe su dung nhieu trong llnh vuc nay nhu:
2.4.1. Phirong phap k lang gieng gan nhat (kNN)
KNN la phuong phap truyen thong kha noi tieng theo hudng tiep can thdng ke da
dupe nghien cuu trong nhieu nam qua. kNN dupe danh gia la mot trong nhung phuong
phap tit nhlt dupe su dung tu nhung thai ky diu trong nghien cuu vi phan loai van
ban. No con co nhiing ten gpi khac nhu Instance-based, Lazy hoac Memory-based.
kNN co the ap dung dupe cho 2 kieu bai toan hpc nhu: Bai toan phan loai va bai toan
du doan/hoi quy. No dupe ling dung thanh cong trong hiu het cac lmh vuc tim kiim
thong tin, nhan dang, phan tich du lieu,....
- Thuat toan: Thuat toan phan lop cua kNN dupe chia lam 2 giai doan:
oGiai doan hpc chi don gian la luu lai cac tap dir lieu hpc.
oGiai doan phan lop: Be pMn lop cho tap dir lieu moi z, ta xac dinh cac
khoang each tit z den x. Xac dinh tap NB(z), cac lang gieng gan nhat cua z tinh theo
ham khoang each d. Ket qua z se dupe phan vao lop chiem so dong trong s6 cac lop
cua t|p du lieu hpc trong NB(z).
Boi vdi phuong phap kNN ham tinh khoang each co vai tro rat quan trong va
thubng dupe xac dinh truoc khong thay doi trong qua trinh hpc va phan loai. Co mot
so ham tinh khoang each lua chpn nhu: Cac ham tinh khoang each hinh hoc, ham tinh
khoang each Hamming, ham tinh dp tuong tu Cosine. Moi loai ham tinh khoang each
thich hop cho tirng loai bai toan rieng. Trong bai toan phan loai van ban ta su dung
ham tinh dp tucmg tu cosine nhu sau:
-^"
11*1111*11
VZl=1^
Trong do: nhung van ban x, z dupe bilu diln la nhung vector dac trung vdi gia
tri trpng sd dupe tinh theo phuang phap TF_IDF.
-Uu diim:
o Qua trinh hpc don gian, ap dung dl dang va cho ket qua phan loai tuang
doi tot.
o
Do khong can phai hpc rieng re n bp phan lap cho n lap nen kNN vln
hoat dong tot vdi cac bai toan vdi so lap kha ldn.
o
kNN (k>=l) cd thi lam dupe ca vdi du lieu loi.
-Nhupc diem:
o
o
Phai xac dinh ham tinh kholng each phu hop.
Mat nhieu thai gian trong qua trinh tim kiem k du lieu Ian can va khd cd
the tim ra k toi uu.
o
Vdi trudng hop van ban cd nhieu thi viec phan loai la khong tot.
SV: Duong Thanh True - DTH082062Trang 11
Tim hieu cac ky thuat phan loai van ban tieng Viet
2.4.2. Phuxmg phap Naive baycs
:
Giai thuat Naive Bayes dua chu ylu vao dinh ly xac suit cua Bayes, vdi gia su
la: cac thupc tinh (bien, chieu) doc lap nhau va do quan trpng cua cac thupc tinh blng
nhau. Mac du viec gia thuyet nay khong bao gia dung vdi du lieu nhung trong thuc te
Naive Bayes cho ket qua kha tot va thanh cong trong lihh vuc phan loai van ban, lpc
thurac,...
- Giai thuat dupe phat bilu nhu sau
Pr(X|K).Pr(Y)
Pr(X)
Trong do:
X, Y la cac bien bat ky (rfri rac, so, cau true,...), du doan Y tu X
Pr(X): Xac suit Xxayra
Pr(Y): Xac suit Yxayra
Pr(X|Y): Xac suit xay ra X vai dilu kien Y xay ra
Pr(Y|X): Xac suit xay ra Y vai dilu kien X xay ra
Ap dung trong bai toan phan loai cac du kien can co:
oD: tSp dft lieu huan luyen da dupe vector hoa duai dang x =
(x1x2l...,xn)
oCi vai i = {1,2,3,...} la tap cac phan lap ma tai lieu cua D thupc chung
oCac thupc tinh Xi,x2,... ,xn doc lap xac suat doi mot vdi nhau.
Theo dinh ly Bayes:
Pr^CJPr[Ct]
Theo tinh chit doc lap dieu kien:
Khi do luat phan lap cho tai lieu mdi Xnew = {xi,x2>...,xn} la
Trong do:
Pr(Ci): Dupe tinh dua tren tan suat xuat hien tai lieu trong tap huan luyen.
Pr(xJC;): Dupe tinh tu nhung tap thupc tinh da dupe tinh trong qua trinh huan
luyen.
- Uu nhupc diem: Naive Bayes la mot phuang phap rat hieu qua trong mot so
trudng hop. Neu tap du lieu huan luyen ngheo nan va cac tham so du doan (nhu khong
gian dac trung) cd chat lupng kem thi se dan den ket qua thap. Tuy nhien, no dupe
danh gia la mot thuat toan phan lap tuyen tinh thich hap trong phan lap van ban nhieu
chii dl vdi mot si uu dilm: Cai dat dofn gian, toe dp nhanh, dl dang cap nhat du lieu
huln luyen mdi vl cd tinh doc lap cao vdi tap huln luyen, cd the su dung kit hop
nhieu tap huan luyen khac nhau. Thong thudng, ngudi ta con dat them mot nguong toi
uu de cho kit qua phan lap kha quan. Du vay, phuang phap nay cd nhupc diem la do
tinh doc lap dilu kien cua cac thupc tinh nen no lam giam dp chinh xac khi phan loai.
SV: Duong Thanh True - DTH082062Trang 12
Tim hilu cac ky thuat phan loai van ban tieng Viet
2.4.3. Phuong phap cay quyet dinh
Cay quyet dinh la mot trong nhom 10 giai thuat hang dau ciia khai mo du
lieu[l 1]. Khac vdi cac mo hinh hoc khac nhu mang na-ron hay may hoc vector ho trq,
mo hinh hpc cua cay quyet dinh don gian, nhanh, cung cho ket qua tot, dac biet ket qua
sinh ra ciia cay quyet dinh la tap cac luat don gian de dien dich. Giai thuat cay quyet
dinh co thi xu ly dupe ca kilu du lieu rdi rac va lien tuc. Cay quyet dinh co the tim
thiy trong hiu h^t cac ling diing nhu: Phan lop du lieu van ban, phan lop thu rac, nhan
dang tan cong va ca vln dl h6i quy.
Giai thuat hpc cay quyet dinh bao gom 2 budc Ion: Xay dung cay (Top-down),
cat nhanh (Bottom-up) de tranh hpc vet. Qua trinh xay dung cay dupe lam nhu sau:
-Bat dau tu nut goc, tat ca cac du lieu hpc 6 nut goc,
-Neu dtt lipu tai 1 nut co cung lap thi nut dupe cho la nut la va nhan ciia niit la
nhan cua cac phSn tu trong niit la (hay luat binh chpn so dong neu nut la co chiia cac
phan tu cd lop khac nhau),
-Nlu du lieu p niit chiia cac phan tu co lop rat khac nhau (khong thuan nh^t) thi
niit dupe chpn la mit trong, tiln hanh phan hoach du lieu mot each de quy blng viec
chpn mot thupc tinh dl thuc hien phan hoach t6t nhSt co thi.
Qua trinh xay dung cay chii yeu phu thupc vao viec chpn thupc tinh tot nhat dl
phan hoach du lieu. Mot thupc tinh dupe cho la tot va duac su dung de phan hoach du
lieu sao cho ket qua thu dupe cay nho nhat. Viec lira chpn nay dua vao cac heuristics:
chpn thupc tinh sinh ra cac niit thuan khiet nhat. Hien nay co 2 giai thuat hpc cay quyet
dinh tieu bieu la C4.5 ciia Quinlan [9], CART ciia Breiman va cac cpng su [7]. Be
danh gia va chpn thupc tinh khi phan hoach du lieu, Quinlan de nghi su dung dp lpi
thong tin (chpn thupc tinh co dp lpi thong tin ldn nhat) va ti so dp lpi dua tren ham
entropy cua Shannon. Trong khi do. Breiman de xuat sir dung chi so Gini (chpn thupc
tinh co chi so Gini nho nhat) de chpn thupc tinh phan hoach.
Dp lpi thong tin cua mot thupc tinh dupe tinh bang dp do hon loan trudc khi phan
hoach trir cho sau khi phan hoach. Gia su Pj la xac suat ma phan tu trong du lieu D
thupc lop Q (i = l,k) khi do:
-Giai thuat C4.5:
oDp do hon loan thong tin trudc khi phan hoach dupe tinh nhu sau:
oDp hon loan sau khi su dung thupc tinh A phan hoach du lieu D thanh v
phan dupe tinh nhu congthiic
V^ \D,\
//()2^L
7
oDp lpi thong tin khi chpn thupc tinh A phan hoach du lieu D thanh v
phin la:
Gain(A) = Info(D) - InfoAP)
- Giai thuat CART:
S V: Duong Thanh True - DTH082062Trang 13
Tim. hieu cac ky thuat phan loai van ban tiSng Viet
•' •• e '
Gini{D) = 11=1
1AA. Phmmg phap may hoc vector ho trgr (SVM)
Dae tnmg ca ban quyet dinh kha nang phan loai cua mot bo phan loai la hieu suat
tong quat hoa, hay la kha nang phan loai nhttng dO lieu mdi dua vao nhung tri thiic da
tich luy dupe trong qua trinh huan luyen. Thuat toan huan luyen dupe danh gia la tot
neu sau tjua trinh huan luyen, hieu suat tong quat hoa cua bp phan loai nhan dupe cao.
Hieu suat tong quat hoa phu thupc vao hai tham s6 la sai so huan luyen va nang luc
ciia may hoc. Trong do sai so huan luyen la ti le 16i phan loai tren tap du lieu huan
luyen. Con nang luc may hpc xac dinh bang kich thudc Vapnik-Chervonenkis (kich
thudc VC). Kich thudc VC la mot khai niem quan trong doi voi mot ho ham phan tach
(hay la bp phan loai). Dai lupng nay dupe xac dinh bang so diem cue dai ma ho ham
co the phan tach hoan toan trong khong gian doi tupng. Mot bp phan loai tot la bp
phan loai co nang luc thap nhat (co nghta la dom gian nhat) va dam bao sai so huan
luyen nho. ^huong phap SVM dupe xay dmg dua tren y tudmg nay.
Xet bai toan phan loai don gian nhat, phan loai hai phan lop voi tap du lieu mau:
Trong do mlu la cac vector ddi tupng dupe phan loai thanh cac mlu duong va .
mau am.
-Cac mau duong la cac mau thupc linh vuc quan tam va dupe gan nhan yj = 1;
-Cac mlu am la cac ralu khong thupc lmh vuc dupe quan tam va dupe gan nhan
yr-1
u
C
Hinh 6: Mat sieu phang phan tach cac mau duomg khoi cac mau dm
Trong trubrig hop nay, bp phan loai SVM la mat sieu phang phan tach cac mau
duong khoi cac mlu am vdi dp chenh lech cue dai, trong do dp chenh lech con gpi la
le (margin) xac dinh blng khoang each giua cac mlu duong va cac mau am gan mat
sieu phang nhlt. Mat sieu phang nay dupe gpi la mat sieu phlng le toi uu.
SV: Duong Thanh True - DTH082062
Trang 14
Tim hieu cac ky thuat phan loai van ban tieng Viet
Cac mat sieu phlng trong khong gian d6i tupng co phuang trinh la: wtx + b =
0, trong do w la vector trong so, b la dp dich. KM thay doi w va b, Mrdng va khoang
each tir goc tpa dp den mat sieu phlng thay doi. Bp phan lo^i SVM dupe djnh nghia
nhu sau:
fix) = signiw^ + b)
. (1)
Trong do:
sign(z) = +1 neu z > 0,
sign(z) = -1 neu z < 0.
Neu f(x) = +1 thi x thupc ve lop duong (linh vuc dupe quan tan), va ngupc lai,
neu f(x) = -1 thi x thupc ve lop am (cac linh vuc khac).
May hoc SVM la mot ho cac mat sieu phang phu thupc vao cac tham so w va b.
Muc tieu cua phuang phap SVM la udc lupng w va b de cue dai hoa 16 giua cac lop du
lieu duang va am. Cac gia tri khac nhau cua 16 cho ta cac ho mat sieu phlng khac
nhau, va 16 cang ldn thi nang luc may hoc cang giam. Nhu vay, cue dai hoa le thuc
chit la tim mot may hoc co nang luc nho nhlt. Qua trinh phan loai la t6i uu khi sai so
phan loai la cue tieu.
N6u tap du lieu huan luyen la kha tach tuy6n tinh thi ta co cac rang bupc sau:
wTxt + b>lneuyi = +l (2)
wTxt + b < 1 neu yt = -1 (3)
Hai mat sieu phang co phuang trinh la wTXi + b = +1 dupe gpi la cac mat sieu
phlng h6 trp (cac duong net dut tren hinh 6).
Be xay dirng mot mat sieu phang 16 toi uu, ta phai giai bai toan quy hoach toan
phuang nhu sau:
Cue dai hoa:
2{li i - |SSLi EJLi atajytyjxfx, (4)
Vdi cac rang bupc:
Ui>0(5)
ZjLiaiyj = 0(6)
Trong do cac he so Lagrange a^, Z= 1,2,..., N, la cac bien can dupe toi uu hoa.
Vector w se dupe tinh tir cac nghiem cua bai toan toan phuang noi tren nhu sau:
w = Ef=1aiyixi
(7)
Be xac dinh dp dich b, ta chpn mot mlu X; sao cho voi a; > 0, sau do su dung
di6u kien Karush-Kuhn_Tucker (KKT) nhu sau:
ai\yi(wTxi + b)-l] = 0(8)
Cac mau X; tuang ling vdi aj > 0 la nhfing mau nam gan mat sieu phang quyet
dinh nMt (thoa man dau dang thuc trong (2, (3)) va dupe gpi la cac vector ho trp.
Nhiing vector ho trp la nhung thanh phan quan trong nhat cua tap du lieu huan luyen.
Bai vi neu chi co cac vector h6 trp, ta van co the xay dung mat sieu phang 16 toi uu
nhu khi co mot tap du lieu huan luyen day du.
Neu tap du lieu huan luyen khong kha tach tuyen tinh thi ta co the giai quyet theo
hai each.
SV: Buong Thanh True - BTH082062Trang 15