Spam email filtering based on machine learning

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (261.59 KB, 5 trang )

Trịnh Minh Đức

Tạp chí KHOA HỌC & CÔNG NGHỆ

118(04): 133 - 137

7

SPAM EMAIL FILTERING BASED ON MACHINE LEARNING
Trinh Minh Duc*
College of Information and Comunication Technology – TNU

SUMMARY
In the paper, we are going to present a spam email filtering method based on machine learning,
namely Naïve Bayes classification method because this approach is highly effective. With the
learning ability (self improving performance), a system applied this method can automatically
learn and ameliorate the effect of spam email classification. Simultaneously, the ability of system’s
classification is also updated by new incoming emails, therefore, it is very difficult for spammers
to overcome the classifier, compared to traditional solutions.
Key words: Machine learning, email spam filtering, Naïve Bayes.

INTRODUCTION*
The Email classification is actually the twoclass text classification problem, that is: the
early dataset consists of spam and non-spam
emails, the texts to be classified as the emails
are sent to inbox. The output of the
classification process is to determine the class
label for an email – belonging to either one of
the two classes: spam or non-spam
.
The general model of the spam email

classification problem can be discribed as
follows:

The categorization process can be divided
two phases:

The training phase: The input of this phase is
the set of spam and non-spam emails. The
output is the trained data applied a suitable
classification method to serve for the
classification period.
The classification phase: The input of this
phase is an email, together with the trained
data. The output is the classification result of
the email: spam or non-spam.
The rest of this paper is organized as follows.
In Sect. 2, we formulate Naïve Bayes
classification method and our solution. In
Sect. 3, we show experimental results to
evaluate the efficiency of this method.
Finally, in Sect. 4, we conclude by showing
possible future directions.
NAïVE
BAYES
METHOD [4]

Figure 1. The spam email classification model
*

Tel: 0984215060; Email:

CLASSIFICATION

Naïve Bayes method is a supervised learning
classification method, and based on
probability, based on a probability model
(function). The classification process is based
on the probability values of the likelihood of
the hypotheses. Classification technique of
Naïve Bayes is based on Bayes theorem and
is particularly suitable for the cases whose
input size is large. Although Naïve Bayes is
133

Trịnh Minh Đức

Tạp chí KHOA HỌC & CÔNG NGHỆ

quite simple, its classification capability is
much better than other complex methods. Due
to the statistically dependent relaxation
hypotheses, Naïve Bayes method considers
the attributes conditionally independent with
each other.

118(04): 133 - 137

Naïve Bayes classifying algorithm can be
described succinctly as follows:

The learning phase (given a training set). For
each classification (i.e., class label) ci ∈ C

Classification problem

estimate the priori probability P( ci). For each

A training set D, where each training instance
x is represented as a n-dimensional attribute

attribute value xj, estimate the probability of

vector x = (x1, x2, ..., xn). A pre-defined set of
classes C={c1, c2 ..., cm}.Given a new instance
z, which class should z be classified into?
Probability P(ck | z) is called the probability
that a new instance z likely belonging to the
class ck is calculated as follows:

P(xj|ci)
The classification phase (given a new
instance). For each classification ci ∈ C,
compute the fomula:
n

P(ci ).∏ P(x j | ci )

c = arg max P(ci | z)
ci∈C

j=1

c = arg max P(ci | z1 , z 2 ,..., z n )

Select the most probable classification c*

ci∈C

P(z1 , z 2 ,..., z n | ci ).P(ci )
c = arg max
P(z1 , z 2 ,..., z n )
ci∈C
ci∈C

P(z1 , z 2 ,..., z n | ci ).P(ci )
( P(z1 , z 2 ,..., z n ) is the same for all classes)
Assumption in Naïve Bayes method: The
attributes are conditionally independent given
classification:
n

P(z1 , z 2 ,..., z n | ci ) = ∏ P(z j | ci )
j=1

Naïve Bayes classifier finds the most
probable class for z:

ci∈C

j=1

j

| ci )

∏ P(z

cNB = arg max P(ci ).

j=1

1. What happens if no training instances
associated with class ci have attribute value
xj?
P( xj | ci ) = 0, and hence:
n

P(ci ).∏ P(x j | ci ) = 0
j=1

Solution: to use a Bayesian approach to
estimate P( xj | ci )

P(x j | ci ) =

n(ci , x j ) + mp
n(ci ) + m

n(ci ) : number of training instances

n

ci∈C

n

∏ P(x

c* = arg max P(ci ).

There are two issues we need to solve:

c= arg max

134

that attribute value given classification ci:

j

| ci )

associated with class ci

Trịnh Minh Đức

Tạp chí KHOA HỌC & CÔNG NGHỆ

n(ci , x j ) : number of training instances

118(04): 133 - 137

efficiency of the classification. An email

associated with class ci that have attribute

consists of

value xj.

title, content, attachment or non… A simple

p: a prior estimate for P(x j | ci ) →
Assume uniform priors: p =

1
, if attribute X
k

has k possible values
m: a weight given to prior → To augment
the n(ci) actual observations by an additional
m virtual samples distributed according to p.

a lot of charateristics such as:

example: if we know that 95% HTML emails
is spam, and we receive a HTML email, thus
being able to base oneself on this prior

probability

in

order

to

compute

the

probability of email that we receive is spam,
if this probability is greater than the

2. The limit of precision in computers’

probability given non-spam, it can be

computing capability

concluded that the email is spam, however,

P(xj | ci) < 1, for every attribute value xj and

this conclusion is not very accurate. However,

class ci. So, when the number of attribute

the more if we know much information, the

values is very large

greater the probability of correct classification
is. To obtain prior probabilities, using Naïve
Bayes method to train the set of early

Solution: to use a logarithmic function of

template emails, then using these probabilities
to classify a new email. The probability

probability
c* =

arg max
ci∈C

calculation will be based on Naïve Bayes
formula. With the obtained probability values,
we compare them with each other. If spam

In the spam email classification problem,

probability is greater than non-spam, then we

each sample that we consider is an email. The

can conclude that the email is spam, the

set of classes that each email can belong to

opposite is non-spam. [5]

C={ spam, non-spam}.

EXPERIMENTAL RESULTS

When we receive an email, if we do not know

We have implemented a test which applied

any information about it, it’s so hard to decide

Naïve Bayes method in email classification.

exactly this email is spam or non-spam. If we

The total number of emails in the sample

have more certain characteristics or attributes

dataset is 4601, including 1813 spam emails

of

(accounting for 39.4%). This dataset which can

an email, then we can improve the

135

Trịnh Minh Đức

Tạp chí KHOA HỌC & CÔNG NGHỆ

be downloaded at />datasets/Spambase

is called Spambase.

This dataset is divided into two disjoint
subsets: the training set D_train (accounting
for 66.7%) – for training the system and the
test set D_test (33.3%) – for evaluating the
trained system.
In order to evaluate a machine learning
system’s performance, we often use some
measures such as: Precision (P), Recall (R),
Accuracy rate (Acc), Error rate (Err), F1measure.

118(04): 133 - 137

n N −> N is the number of non-spam emails
which the filter recognizes as non-spam
N N is the total number of non-spam
emails
NS is the total number of spam emails
Experimental results on the Spambase
dataset

We present test results for two options of the
division of the Spambase dataset:
Experiment 1: divide the original Spambase
dataset with a proportion k1 =

2
2
, in that,
3
3

the dataset for training and the remaining for
testing.
Experiment 2: divide the original Spambase
dataset with a proportion k2 =

Formulas to compute these measures as
follows:

n S −>S
n S −>S + n N −>S

P=

n S−>S
n S−>S + n S−> N

R=

Acc =

Err =

F1 =

n N −> N + n S−>S
N N + NS
n N −>S + n S−> N
N N + NS

2.P.R
2
=
P+R 1 + 1
P R

Where:

n S−>S is the number of spam emails which

9
, in that,
10

9
the dataset for training and the remaining
10
for testing.
Table 1. Testing results

n S−>S
n S−> N
n N −> N
n N −>S
Recall
Precison
Acc
Err
F1-measure

Experiment 1
486

Experiment 2
180

119

2

726

276

204

3

80.33%
70.43%

78.96%
21.04%
75.05%

98.90%
98.36%
98.92%
1.18%
98.63%

The testing result in this experiment 2 have
very high accuracy (approximately 99%).
Conclusion

the filter recognizes as spam
n S−> N is the number of spam emails

In this paper, we have examined the effect of

which the filter recognizes as non-spam
n N −>S is the number of non-spam emails

a classifier which has a self-learning ability to

which the filter recognizes as spam
136

the Naïve Bayes classification method. This is
improve classification performance. Naïve

Trịnh Minh Đức

Tạp chí KHOA HỌC & CÔNG NGHỆ

118(04): 133 - 137

Bayes classifier proved suitable for email

REFERENCES

classification problem. Currently, we are

[1]. Jonathan A.Zdziarski, Ending Spam: Bayesian
Content Filtering and the Art of Statistical
Language Classification - Press 2006.
[2]. Mehran Sahami. Susan Dumais. David
Heckerman. Eric Horvitz., A Bayesian Approach
to Filtering Junk E-Mail.
[3]. Sun Microsystem, JavaMail API Design
Specification Version 1.4.
[4]. T. M. Mitchell. Machine Learning. McGrawHill, 1997.
[5]. Lê Nguyễn Bá Duy , Tìm hiểu các hướng
tiếp cận phân loại email và xây dựng phần mềm
mail client hỗ trợ tiếng việt, Đại học khoa học tự
nhiên, Tp.Hồ Chí Minh, 2005.

continuing to build standard as well as
training samples and can adjust the Naïve
Bayes algorithm to improve the accuracy of

classification. In the near future, we will build
a standard training and testing data system for
both English and Vienamese. This is a big
problem and need to focus more effort.

TÓM TẮT
LỌC THƯ RÁC DỰA TRÊN HỌC MÁY
Trần Minh Đức*
Trường ĐH Công nghệ thông tin và Truyền thông – ĐH Thái Nguyên

Trong bài báo này tôi giới thiệu phương pháp phân loại thư rác dựa trên học máy, vì cách tiếp cận
này có hiệu quả cao. Với khả năng học (tự cải thiện hiệu năng), thì hệ thống có thể tự động học và
cải thiện được hiệu quả phân loại thư rác. Đồng thời, khả năng phân loại của hệ thống cũng sẽ liên
tục được cập nhật theo những mẫu thư rác mới và vì vậ y, sẽ rất khó để các spammers vượt qua
được, so với các cách tiếp cận truyền thống khác.
Từ khóa: Học máy, lọc thư rác, Naïve Bayes.

Ngày nhận bài: 13/3/2014; Ngày phản biện: 15/3/2014; Ngày duyệt đăng: 25/3/2014
Phản biện khoa học: TS. Trương Hà Hải – Trường ĐH CNTT&TT – ĐH Thái Nguyên
*

Tel: 0984215060; Email:

137

Spam email filtering based on machine learning

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về