Tải bản đầy đủ (.pdf) (4 trang)

Transductive support vector machines for cross lingual sentiment classification

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (221 KB, 4 trang )

Transductive Support Vector Machines
for Cross-lingual Sentiment Classification

Nguyen Thi Thuy Linh
Faculty of Information Technology
University of Engineering and Technology
Vietnam National University, Hanoi
Supervised by
Professor Ha Quang Thuy

A thesis submitted in fulfillment of the requirements for the degree of
Master of Computer Science
December, 2009


Table of Contents
1 Introduction
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . .
1.2 What might be involved? . . . . . . . . . . . . . . .
1.3 Our approach . . . . . . . . . . . . . . . . . . . . .
1.4 Related works . . . . . . . . . . . . . . . . . . . . .
1.4.1 Sentiment classification . . . . . . . . . . . .
1.4.1.1 Sentiment classification tasks . . .
1.4.1.2 Sentiment classification features . .
1.4.1.3 Sentiment classification techniques
1.4.1.4 Sentiment classification domains .
1.4.2 Cross-domain text classification . . . . . . .
2 Background
2.1 Sentiment Analysis . . . . . . . . . . . . . .
2.1.1 Applications . . . . . . . . . . . . . .
2.2 Support Vector Machines . . . . . . . . . . .


2.3 Semi-supervised techniques . . . . . . . . . .
2.3.1 Generate maximum-likelihood models
2.3.2 Co-training and bootstrapping . . . .
2.3.3 Transductive SVM . . . . . . . . . .
3 The
3.1
3.2
3.3

semi-supervised model for cross-lingual
The semi-supervised model . . . . . . . . . .
Review Translation . . . . . . . . . . . . . .
Features . . . . . . . . . . . . . . . . . . . .
3.3.1 Words Segmentation . . . . . . . . .
3.3.2 Part of Speech Tagging . . . . . . . .
3.3.3 N-gram model . . . . . . . . . . . . .
ii

.
.
.
.
.
.
.

.
.
.
.

.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.


.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

approach
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.

.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.


.
.
.
.
.
.
.
.
.
.

1
1
3
3
4
4
4
4
4
5
5

.
.
.
.
.
.

.

6
6
7
7
10
10
11
11

.
.
.
.
.
.

13
13
16
16
16
18
18


TABLE OF CONTENTS

iii


4 Experiments
4.1 Experimental set up . . . . . . . . . .
4.2 Data sets . . . . . . . . . . . . . . . .
4.3 Evaluation metric . . . . . . . . . . . .
4.4 Features . . . . . . . . . . . . . . . . .
4.5 Results . . . . . . . . . . . . . . . . . .
4.5.1 Effect of cross-lingual corpus . .
4.5.2 Effect of extraction features . .
4.5.2.1 Using stopword list . .
4.5.2.2 Segmentation and Part
4.5.2.3 Bigram . . . . . . . .
4.5.3 Effect of features size . . . . . .

20
20
20
22
22
23
23
24
24
24
25
25

. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .

. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
of speech tagging
. . . . . . . . . .
. . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.

5 Conclusion and Future Works

28

A

30

B

32


Abstract
Sentiment classification has been much attention and has many useful applications
on business and intelligence. This thesis investigates sentiment classification problem employing machine learning technique. Since the limit of Vietnamese sentiment
corpus, while there are many available English sentiment corpus on the Web. We
combine English corpora as training data and a number of unlabeled Vietnamese
data in semi-supervised model. Machine learning eliminates the language gap between the training set and test set in our model. Moreover, we also examine types
of features to obtain the best performance.
The results show that semi-supervised classifier are quite good in leveraging
cross-lingual corpus to compare with the classifier without cross-lingual corpus. In
term of features, we find that using only unigram model turning out the outperformace.




×