01- mailrank using ranking for spam detection

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (277.62 KB, 8 trang )

MailRank: Using Ranking for Spam Detection
Paul - Alexandru Chirita
L3S Research Center /
University of Hannover
Deutscher Pavillon, Expo Plaza 1
30539 Hannover, Germany

J
¨
org Diederich
L3S Research Center /
University of Hannover
Deutscher Pavillon, Expo Plaza 1
30539 Hannover, Germany

Wolfgang Nejdl
L3S Research Center /
University of Hannover
Deutscher Pavillon, Expo Plaza 1
30539 Hannover, Germany

ABSTRACT
Can we use social networks to combat spam? This paper investi-
gates the feasibility of MailRank, a new email ranking and classi-
ﬁcation scheme exploiting the social communication network cre-
ated via email interactions. The underlying email network data is
collected from the email contacts of all MailRank users and up-
dated automatically based on their email activities to achieve an
easy maintenance. MailRank is used to rate the sender address of
arriving emails such that emails from trustworthy senders can be
ranked and classiﬁed as spam or non-spam. The paper presents two

variants: Basic MailRank computes a global reputation score for
each email address, whereas in Personalized MailRank the score of
each email address is different for each MailRank user. The eval-
uation shows that MailRank is highly resistant against spammer
attacks, which obviously have to be considered right from the be-
ginning in such an application scenario. MailRank also performs
well even for rather sparse networks, i.e., where only a small set of
peers actually take part in the ranking of email addresses.
Categories and Subject Descriptors
G.2.2 [Discrete Mathematics]: Graph Theory; H.3.4 [Information
Systems]: Information Storage and Retrieval—Systems and Soft-
ware; H.2.7 [Information Systems]: Database Management—Se-
curity, Integrity and Protection
General Terms
Algorithms, Experimentation, Measurements
Keywords
Email Reputation, SPAM, MailRank, Personalization
1. INTRODUCTION
While scientiﬁc collaboration without email is almost unthink-
able, the tremendous increase of unsolicited email (spam) over the
past years [5] has rendered email communication without spam
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for proﬁt or commercial advantage and that copies
bear this notice and the full citation on the ﬁrst page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior speciﬁc
permission and/or a fee.
CIKM ’05 Bremen, Germany
Copyright 200X ACM X-XXXXX-XX-X/XX/XX $5.00.
ﬁltering almost impossible. Currently, spam emails already out-

number non-spam ones, so-called ‘ham emails’. Existing spam ﬁl-
ters such as the SpamAssassin System
1
, SpamBouncer
2
, or Mozilla
Junk Mail Control
3
still exhibit some problems, which can be clas-
siﬁed in two main categories:
1. Maintenance, for both the initialization and the adaptation
of the ﬁlter during operation, since all spam ﬁlters rely on
a certain amount of input data to be maintained: Content-
based ﬁlters require keywords and rules for spam recogni-
tion, blacklists have to be populated with IP addresses from
known spammers, and Bayesian ﬁlters need a training set
of spam / ham messages. This input data has to be created
when the ﬁlter is used ﬁrst (the ‘cold-start’ problem), and
it also has to be adapted continuously to counter attacks of
spammers [7, 19].
2. Residual error rates, since current spam ﬁlters cannot elim-
inate the spam problem completely. First, a non-negligible
number of spam emails still reaches the end user, so-called
false negatives. Second, some ham messages are discarded
because the anti-spam system considers them as spam. Such
false positives are especially annoying if the sender of the
email is from the recipient’s community and thus already
known to the user, or at least known by somebody else the
user knows directly. Therefore, there is a high probability
that an email received from somebody within the social net-

work of the receiver is a ham message. This implies that a
social network formed by email communication can be used
as a strong foundation for spam detection.
Even if there existed a perfect anti-spam system, an additional
problem would arise for high-volume email users, some of which
simply get too many ham emails. In these cases, an automated
support for email ranking would be highly desirable. Reputation
algorithms are useful in this scenario, because they provide a rat-
ing for each email address, which can subsequently be used to sort
incoming emails. Such ratings can be gained in two ways, glob-
ally or personally. The main idea of a global scheme is that people
share their personal ratings such that a single global rating (called
reputation) can be inferred for each email address. The implemen-
tation of such a scheme can, for example, be based on network rep-
utation algorithms [6] or on collaborative ﬁltering techniques [17].
In case of a personalized scheme, the ratings (called trust in this
case) are typically different for each email user and depend on her
personal social network. Such a scheme is reasonable since some
people with a presumably high global reputation (e.g., Linus Tor-
1

2

3
/>valds) might not be very important in the personal context of a user,
compared to other persons (e.g., the project manager).
In this paper we propose MailRank, a new approach to ranking
and classifying emails according to the address of email senders.
The central procedure is to collect data about trusted email ad-
dresses from different sources and to create a graph for the social

network, derived from each user’s communication circle [1]. There
are two MailRank variants, which both apply a power-iteration al-
gorithm on the email network graph: Basic MailRank results in a
global reputation for each known email address, and Personalized
MailRank computes a personalized trust value. MailRank allows to
classify email addresses into ‘spammer address’ and ‘non-spammer
address’ and additionally to determine the relative rank of an email
address with respect to other email addresses. This paper analyzes
the performance of MailRank under several scenarios, including
sparse networks, and shows its resilience against spammer attacks.
The paper is organized as follows: Section 2 provides informa-
tion about existing anti-spam approaches, trust and reputation algo-
rithms, as well as a description of PageRank and some approaches
to personalizing it. In Sect. 3, we describe our proposed variants of
MailRank, which we then evaluate in Sect. 4. Finally, our results
are summarized in Sect. 5.
2. BACKGROUND AND RELATED WORK
2.1 Anti-Spam Approaches
Because of the high relevance of the spam problem, many at-
tempts to counter spam have been started in the past, including
some law initiatives. Technical anti-spam approaches comprise one
or several of the following basic approaches [16]:
• Content-based approaches
• Header-based approaches
• Protocol-based approaches
• Approaches based on sender authentication
• Approaches based on social networks
Content-based approaches [7] analyze the subject of an email
or the email body for certain keywords (statically provided or dy-
namically learned using a Bayesian ﬁlter) or patterns that are typ-

ical for spam emails (e.g., URLs with numeric IP addresses in the
email body). The advantage of content-based schemes is their abil-
ity to ﬁlter quite a high number of spam messages. For exam-
ple, SpamAssassin can recognize 97% of the spam if an appro-
priately trained Bayesian ﬁlter is used together with the available
static rules [10]. The main drawback is that they (e.g., the set of
static keywords) have to be adapted continuously since otherwise
the high spam recognition rate will decrease [10].
Header-based approaches examine the headers of email mes-
sages to detect spam. Whitelist schemes collect all email addresses
of known non-spammers in a whitelist to decrease the number of
false positives from content-based schemes. In contrast, black-
list schemes store the IP addresses (email addresses can be forged
easily) of all known spammers and refuse to accept emails from
them. A manual creation of such lists is typically highly accurate
but puts quite a high burden on the user to maintain it. PGP key
servers could be considered a manually created global whitelist.
An automatic creation can be realized, for instance based on pre-
vious results of a content-based ﬁlter as is done with so-called au-
towhitelists in SpamAssassin. Both blacklists and whitelists are
rather difﬁcult to maintain, especially when faced with attacks from
spammers who want togettheir email addresses on thelist (whitelist)
or off the list (blacklist).
Protocol-based approaches proposechanges to the utilized email
protocol. Challenge-response schemes [16] require a manual effort
to send the ﬁrst email to a particular recipient. For example, the
sender has to go to a certain web page and activate the email man-
ually, which might involve answering a simple question (such as
solving a simple mathematical equation). Afterwards, the sender
will be added to the recipient’s whitelist such that further emails

can be sent without the activation procedure. The activation task
is considered too complex for spammers, who usually try to send
millions of spam emails at once. An automatic scheme is used in
the greylisting approach
4
, where the receiving email server requires
each unknown sending email server to resend the email again later.
To prevent spammers from forging their identity (and allow for
tracking them), several approaches for sender authentication [5]
have been proposed. They basically add another entry to the DNS
server, which announces the designated email servers for a partic-
ular domain. A server can use a reverse lookup to verify if a re-
ceived email actually came from one of these email servers. Sender
authentication is a requirement for whitelist approaches since oth-
erwise spammers can just use well-known email addresses in the
‘From:’ line. Though it is already implemented by large email
providers (e.g., AOL, Yahoo), it also requires further mechanisms,
such as ablacklist or a whitelist, for an effective spamﬁltering since
spammers can easily set up their own domains and DNS servers.
Recent approaches have started to exploit information from so-
cial networks for spam detection. Such social network based ap-
proaches construct a graph, whose vertices represent email ad-
dresses. A directed edge is added between two nodes A and B,
if A has sent an email to B. Boykin and Roychowdhury [1] ini-
tially classify email addresses based on the clustering coefﬁcient of
the graph subcomponent: For spammers, thiscoefﬁcient isvery low
because they typically do not exchange emails with each other. In
contrast, the clustering coefﬁcient of the subgraph representing the
actual social network of a non-spammer (colleagues, friends, etc.)
is rather high. The scheme can classify 53% of the emails correctly

as ham or spam, leaving the remaining emails for further exami-
nation by other approaches. Spammers can attack the scheme by
cooperating and building their own social networks. Golbeck and
Hendler propose another scheme to rank email addresses, based on
exchange of reputation values [6]. The main problem of this ap-
proach is that its attack resilience has not been veriﬁed.
2.2 Trust and Reputation Algorithms
Trust and reputation algorithms have become increasingly pop-
ular to rank a set of items, such as web pages (web reputation) or
people (social reputation), for example, when selling products in
online auctions. Their main advantage is that most of them are de-
signed for high attack resilience.
Web reputation schemes result in a single score for each Web
page. PageRank [15] computes these scores by means of link anal-
ysis, i.e., based on the graph inferred from the link structure of the
Web. The main idea is that “a page has a high rank if the sum of the
ranks of its backlinks is high”. Given a page p, the set of its input
links I(p) and output links O(p), the PageRank score is computed
according to the formula:
P R(p) = c ·

q∈I (p)
P R(q)
O (q)
+ (1 − c) · E(p) (1)
The damping factor c < 1 (usually 0.85) is necessary to guar-
antee convergence and to limit the effect of rank sinks, one very
simple attack on PageRank. Intuitively, a random surfer will fol-
low an outgoing link from the current page with probability c or
4

/>will get bored and select a random page with probability (1 − c)
(i.e., the E vector has all entries equal to 1/N, where N is the
number of pages in the Web graph). To achieve personalization,
the random surfer must be redirected towards the preferred pages
by modifying the entries of E. Several distributions for this vector
have been proposed since: TrustRank [8] biases towards a set of
ham pages in order to identify Web spam, HubRank [2] gives an
additional importance to hubs, pages collecting links to many other
important pages on the Web, etc.
Personalized PageRank [11] uses a new approach: it focuses on
user proﬁles. One Personalized PageRank Vector (PPV) is com-
puted for each user. The personalization aspect stems from a set
of hubs (H), each user having to select her preferred pages from it.
For each page of H, an auxiliary PPV called basis vector is precom-
puted. Then, PPVs for any preference set P are expressed as a lin-
ear combination of basis vectors. To avoid the massive storage re-
sources the basis hub vectors would use, they are decomposed into
partial vectors (encoding the part unique to each page, computed
at run-time) and the hubs skeleton (capturing the interrelationships
among hub vectors, stored off-line). Section 3.3 discusses how this
can be adapted to our email ranking and classiﬁcation scenario.
Social reputation schemes are usually designed for use within
P2P networks. However, they provide an useful insight into uti-
lizing link analysis to construct reputation systems, as well as into
identifying different attack scenarios. [21] presents a categoriza-
tion of trust metrics, as well as a ﬁxed-point personalized trust al-
gorithm inspired by spreading activation models. It can be viewed
as an application of PageRank on a sub-graph of the social net-
work. [18] builds a Web of trust asking each user to maintain trust
values on a small number of other users. The algorithm presented

is also based on a power iteration, but designed for an application
within the context of the Semantic Web, composed of logical asser-
tions. Finally, EigenTrust [12] is a pure ﬁxed-point PageRank-like
distributed computation of reputation values for P2P environments.
This algorithm is also used in the MailTrust approach [13].
3. MAILRANK
In order to compute a rank for each email address, MailRank
collects data about the social networks derived from email commu-
nication of all MailRank users and aggregates them into a single
email network. Figure 1 depicts an example email network graph.
Node U
1
represents the email address of U
1
, node U
2
the email ad-
Figure 1: Sample email network
dress of U
2
, and so on. U
1
has sent emails to U
2
, U
4
, and U
3
; U
2

has sent emails to U
1
and U
4
, etc. These communication acts are
then interpreted as trust votes, e.g., from U
1
towards U
2
, U
4
and
U
3
, and depicted in the ﬁgure using arrows.
Building upon the email network graph, we can use a power iter-
ation algorithm to compute a score for each email address. This can
subsequently be used for at least two purposes, namely: (1) Clas-
siﬁcation into spam and ham emails, and (2) build up a ranking
among the remaining ham emails.
The computation includes the email addresses of all voters (i.e.
the ‘actively participating’MailRank users) andthe email addresses
speciﬁed in the votes. Therefore, it is not necessary that all email
users participate in MailRank to beneﬁt from it: For example, U
3
does not specify any vote but still receives a vote from U
1
and will,
thus, achieve some score (if U
1

is not a spammer itself).
MailRank has the following advantages:
• Shorter individual cold-start phase. If a MailRank user
does not know an email address X, MailRank can provide a
rank for X as long as at least another MailRank user has pro-
vided information about it. Thus, the so-called “cold-start”
phase, i.e., the time a system has to learn until it becomes
functional, is reduced: While most successful anti-spam ap-
proaches (e.g., Bayesian ﬁlters) have to be trained for each
single user (in case of an individual ﬁlter) or a group of users
(for example, in case of a company-wide ﬁlter), MailRank
requires only a single global cold start phase when the sys-
tem is bootstrapped. In this sense it is similar to globally
managed whitelists, but it requires less administrative efforts
to manage the list and it can additionally provide informa-
tion about “how good” an email address is, and not only a
classiﬁcation into “good” or “bad”.
• High attack resilience. MailRank is based on a power it-
eration algorithm, which is typically highly resistant against
attacks. This will be discussed for MailRank in particular in
Section 4.3.
• Partial participation. Building on the power-law nature of
email networks, MailRank can compute a rank for a high
number of email addresses even if only a subset of email
users actively participates in MailRank.
• Stable results. Social networks are typically very stable, so
the computed ratings of the email addresses will also change
only slowly over time. Hence, spammers need to behave well
for quite some time to achieve a high rank. Though this can-
not resolve the spam problem entirely (in the worst case, a

spammer could, for example, buy email addresses from peo-
ple who have behaved well for some time), it will increase
the cost for using new email addresses.
• Can reduce loadon email servers. Email servers don’t have
to process the email body to detect spam. This signiﬁcantly
reduces the computational power for spam detection com-
pared to, for example, content-based approaches or collabo-
rative ﬁlters [13].
• Personalization. In contrast to spamclassiﬁcation approaches
that distinguish only between ‘spam’ and ‘non-spam’, rank-
ing approaches more easily enable personalization features.
This is important since there are certain emailaddresses (e.g.,
newsletters), which some people consider to be spammers
while others don’t. To deal with such cases, a MailRank user
can herself decide about the score threshold, below which
all email addresses are considered spammers. Moreover, she
could use two thresholds to determine ‘spammers’, ‘don’t
know’, and ‘non-spammers’. Furthermore, she might want
to give more importance to her relatives or to her manager,
than to other unrelated persons with a globally high reputa-
tion (e.g., Linus Torvalds).
• Scalable computation. Power iteration algorithmshave been
shown to be computationally feasible even for very large
graphs even in the presence of personalization [11].
• Can also counter other forms of spam. When receiving
spam phone calls (SPIT
5
), for example, it is not possible to
5
Spam over Internet Telephony,

1.html
analyze the content of the call before accepting / rejecting it.
At best, only the caller identiﬁer is available, which is similar
to the sender email address. MailRank can be used toanalyze
the caller identiﬁer to decide whether a caller is a spammer
or not.
The following sections provide more information about eachcen-
tral aspect of MailRank: what data areused by the algorithm, where
these data are stored, how the ranks are generated and how we can
ﬁnally use them for computing global or personalized reputation
scores.
3.1 Bootstrapping the email network
As for all trust / reputation algorithms, it is necessary to collect
as many personal votes as possible in order to compute relevant
ratings. Collecting the personal ratings should require few or no
manual user interactions in order to achieve a high acceptance of
the system. Similarly, the system should be maintained with lit-
tle or no effort at all, thus having the rating of each email address
computed automatically.
To achieve these goals, we use already existing data inferred
from the communication dynamics, i.e., who has exchanged emails
with whom. This results in a global email social network. We dis-
tinguish three information sources as best serving our purposes:
1. Email Address Books. If A has theaddresses B
1
, B
2
, , B
n
in its Address Book, then A can be considered to trust them

all, or to vote for them.
2. The ‘To:’ Fields of outgoing emails (i.e., ‘To:’, ‘Cc:’ and
‘Bcc:’). If A sends emails to B , then it can be regarded as
trusting B, or voting for B. This input data is typically very
accurate since it is manually selected (i.e. it does not con-
tain spammer addresses), and it is more accurate than data
from address books, since address books can comprise old
or outdated information and there is normally no informa-
tion available about when the address book entry was cre-
ated / modiﬁed last. Furthermore, address books are private
and would have to be released manually by the owner to be
accessible for the MailRank system. In contrast, data based
on the ‘To:’ ﬁelds can also be extracted automatically via a
light-weight email proxy deployable on any machine.
3. Autowhitelists createdby anti-spam tools (e.g., SpamAssassin)
contain a list of all email addresses from which emails have
been received recently, plus one score for each email address
which determines if mainly spam or ham emails have been
received from the associated email address. All email ad-
dresses with a high score can be regarded as being trusted.
3.2 Basic MailRank
The main goal of MailRank is to assign a rank to each email ad-
dress known to the system andto use this rank (1) to decide whether
each email is coming from a spammer or not, and (2) to build up
a ranking among the ﬁltered non-spam emails. Its basic version
comprises two main steps:
1. Determine a set of email addresses with a very high reputa-
tion in the social network.
2. Run the power iterationalgorithm on the email network graph,
biased on the above determined set to compute the ﬁnal Mail-

Rank score for each email address.
Regarding the attack resilience, it is important for the biasing
set not to include any spammer. This is a very efﬁcient way to
counter malicious collectives ofspammers trying to attack therank-
ing system [8, 12]. In principle, there are three possible methods
to determine the biasing set: manually, automatically, or semi-
automatically. A manual selection guarantees that no spammers
will be in the biasing set and can in this way counter malicious
collectives entirely. An automatic selection can avoid the (possi-
bly costly) manual selection of the biasing set. A semi-automatic
selection of the biasing set can use the above described automatic
selection to propose a biasing set for being veriﬁed manually to be
free of spammers. We propose the following heuristics to deter-
mine the biasing set automatically:
We ﬁrst determine the size p of the biasing set by adding the
ranks of the R nodes with the highest rank such that the sum of the
ranks of these R nodes is equal to 20% of the total rank in the sys-
tem. Also, we additionally limit p to the minimum of R and 0.25%
of the total number of email addresses in the graph
6
. In this manner
we limit the biasing set to the few most reputable members of the
social network, because of the power-law distribution of email ad-
dresses [4, 9]. Thus, we can exclude spammers effectively even if
the spammer email addresses constitute the majority in the graph.
The result of the overall MailRank algorithm, the ﬁnal vector
of MailRank scores, can be used to tag an incoming email on the
email proxy as (1) non-spammer, if the ﬁnal score of the sender
email address is larger than a threshold T, (2) spammer, if the ﬁnal
score of the sender email address is smaller than T, or (3) unknown,

if the email address is not yet known to the system
7
.
Each user can adjust T according to her preferred ﬁltering level.
If T = 0, the algorithm is effectively used to compute the transitive
closure of the email network graph starting from the biasing set.
This is sufﬁcient to detect all those spammers for which no user
reachable from the biasing set has issued a vote. With T > 0, it
becomes possible to detect spammers even if some non-spammers
vote for spammers (e.g., because the computer of a non-spammer
is infected by a virus). However, in this case some non-spammers
with a very low rank are at risk of being counted as spammers.
The Basic MailRank algorithm is summarized in Alg. 3.1.
Algorithm 3.1. The Basic MailRank Algorithm.
Client Side:
Each vote sent to the MailRank server comprises:
Addr(u) : The hashed version of the email address of the voter u.
TrustVotes(u) : Hashed version of all email addresses
u votes for (i.e., she has sent an email to)
Server Side:
1: Combine all received data into a global email network graph. Let
T be the Markov chain transition probability matrix, computed as:
ForEach known email address i
If i is a registered address, i.e., user i has submitted her votes
ForEach trust vote from i to j
T
ji
= 1/NumOfVotes(i)
Else ForEach known address j
T

ji
= 1/N, where N is the number of known addresses.
3: Determine the biasing set B (i.e., the most popular email addr.)
3a: Manual selection or
3b: Automatic selection or
3c: Semi-automatic selection
4: Let T

= c · T + (1 − c) · E, with c = 0.85 and
E[i] = [
1
||B||
]
N×1
, if i ∈ B, or E[i] = [0]
N×1
, otherwise
5: Initialize the vector of scores x = [1/N]
N×1
, and the error δ = ∞
6: While δ < ,  being the precision threshold
x

= T

· x
δ = ||x

− x||
7: Output x


, the global MailRank vector.
8: Classify each email address in the MailRank network into:
‘spammer’ / ‘non-spammer’ based on the threshold T
6
Both values, the ‘20%’ and the ‘0.25%’ have been determined in
extensive simulations that are not shown here.
7
To allow new, unknown users to participate in MailRank, an au-
tomatically generated email could be sent to the unknown user en-
couraging her to join MailRank (challenge-response scheme), thus
bringing her into the non-spammer area of reputation scores.
3.3 MailRank with Personalization
As shown in the experiments section, Basic MailRank performs
very well in spam detection, while being highly resistant against
spammer attacks. However, it still has the limitation of being too
general with respect to user ranking. More speciﬁcally, it does not
address that:
• Users generally communicate with persons ranked average
with respect to the overall rankings.
• Users prefer to have their acquaintances ranked higher than
other unknown users, even if theselatter ones achieve ahigher
overall reputation from the network.
• There should be a clear difference between a user’s commu-
nication partners, i.e., the ones with a higher rank should be
easily recognizable.
Personalizing on each user’s acquaintances tackles these aspects.
Its main effect is boosting the weight of the user’s votes, while de-
creasing this inﬂuence for all the other votes. Thus, the direct com-
munication partners will achieve much higher ranks, even though

initially they were not among the highest ones. Moreover, due to
the rank propagation, their votes will have a high inﬂuence as well.
Now that we have captured the user requirements mentioned, we
should also focus our attention on a ﬁnal design issue of our system:
scalability. Simply biasing MailRank on user’s acquaintances will
not scale well, because it must be computed for each preference set,
that is for every registered user.
Jeh and Widom [11] have proposed an approach to calculate
Personalized PageRank vectors, which can also be adapted to our
scenario, and which can be used with millions of subscribers. To
achieve scalability, the resulting personalized vectors are divided in
two parts: one common to all users, precomputed and stored off-
line (called “partial vectors”), and one which captures the speciﬁcs
of each preference set, generated at run-time (called “hubs skele-
ton”). We will have to deﬁne a restricted set of users on which
rankings can be biased (we shall call this set “hub set”, and note
it with H). There is one partial vector and one hub skeleton for
each user from H. Once an additional regular user registers, her
personalized ranking vector will be generated by reading the al-
ready precomputed partial vectors corresponding to her preference
set (step 1), by calculating their hubs skeleton (step 2), and ﬁ-
nally by tying these two parts together (step 3). Both the algorithm
from step 1 (called “Selective Expansion”) and the one from step
2 (named “Repeated Squaring”) can be mathematically reduced to
biased PageRank. The latter decreases the computation error much
faster along the iterations and is thus more efﬁcient, but works only
with the output of the former one as input. In the ﬁnal phase, the
two sub-vectors resulted from the previous steps are combined into
a global one. The algorithm is depicted in the following lines. To
make it clearer, we have also collected the most important deﬁni-

tions it relies on in table 1.
Term Description
Set V The set of all users.
Hub Set H A subset of users.
Preference Set P Set of users on which to personalize.
Preference Vector p Preference set with weights.
Personalized PageRank
Vector (PPV)
Importance distribution induced by a preference vector.
Basis Vector r
u
PPV for a preference vector with a single nonzero entry
at u.
Hub Vector r
u
Basis vector for a hub user u ∈ H.
Partial Vectorr
u
−r
H
u
Used with the hubs skeleton to construct a hub vector.
Hubs Skeleton r
u
(H) Used with partial vectors to construct a hub vector.
Table 1: Terms speciﬁc to Personalized MailRank.
Finally, weshould note that the original algorithm hasbeen proven
by [11] to be equivalent to a biased PageRank. Thus, it preserves all
the useful properties of the PageRank algorithm (e.g., convergence
in the presence of loops in the voting graph, resistance against ma-

licious attacks, etc.), while being much more scalable.
Algorithm 3.2. Personalized MailRank.
0: (Initializations) Let u be a user from H, for which we compute the partial vector,
and the hubs skeleton. Also, let D[u] be the approximation of the basis
vector corresponding to user u, and E[u] the error of its computation.
Initialize D
0
[u] with:
D
0
[u](q) =

c = 0.15 , q ∈ H
0 , otherwise
Initialize E
0
[u] with:
E
0
[u](q) =

1 , q ∈ H
0 , otherw ise
1: (Selective Expansion) Compute the partial vectors using
Q
0
(u) = V and Q
k
(u) = V \ H, for k > 0, in the formulas below:
D

k+1
[u] = D
k
[u] +

q∈Q
k
(u)
c · E
k
[u](q)x
q
E
k+1
[u] = E
k
[u] −

q∈Q
k
(u)
E
k
[u](q)x
q
+

q∈Q
k
(u)

1−c
|O(q)|

|O(q)|
i=1
E
k
[u](q)x
O
i
(q)
Under this choice, D
k
[u] + c ∗ E
k
[u] will converge to r
u
− r
H
u
,
the partial vector corresponding to u.
2: (Repeated squaring) Having the results from the ﬁrst step as input, one can now
compute the hubs skeleton (r
u
(H)). This is represented by the ﬁnal D[u] vectors
calculated using Q
k
(u) = H into:
D

2k
[u] = D
k
[u] +

q∈Q
k
(u)
E
k
[u](q) ∗ D
k
[q ]
E
2k
[u] = E
k
[u] −

q∈Q
k
(u)
E
k
[u](q)x
q
+

q∈Q
k

(u)
E
k
[u](q)E
k
[q ]
As this step refers to hub-users only, the computation of D
2k
[u] and E
2k
[u]
should consider only the components regarding users from H,
as it signiﬁcantly decreases the computation time.
3: Let p = α
1
u
1
+ · · · + α
z
u
z
be a preferenced vector,
where u
i
are from H and i is between 1 and z, and let:
r
p
(h) =

z

i=1
α
i
(r
u
i
(h) − c ∗ x
p
i
(h)), h ∈ H
which can be computed from the hubs skeleton.
The PPV v for p can then be constructed as:
v =

z
i=1
α
i
(r
u
i
− r
H
u
i
)+
1
c

h∈H r

p
(h)>0
r
p
(h) ∗

(r
u
− r
H
u
) − c ∗ x
h

3.4 MailRank System Architecture
MailRank is composed of a server, which collects all user votes
and delivers a score for any known email address, and an email
proxy on the client side, which interacts with the MailRank server.
The MailRank Server collects the input data (i.e., the votes)
from all MailRank users to run the MailRank algorithm. The votes
are assigned with a lifetime for (1) Identifying and deleting email
addresses which haven’t been used for a long time, and (2) Detect-
ing spammers which behave good for some time to get a high rank
and start to send spam emails afterwards.
The MailRank Proxy resides between user’s email client and
her regular local email server. It performs two tasks: When re-
ceiving an outgoing email, it ﬁrst extracts the user’s votes from the
available input data (e.g., by listening to ongoing email activities or
by analyzing existing sent-mail folders). Then, it sends the votes
to the MailRank server and forwards the email to the local email

server. To increase efﬁciency, only those votes that have not been
submitted yet (or that would expire otherwise) are sent. Also, for
privacy reasons, votes are encoded using hashed versions of email
addresses. Upon receiving an email, the proxy queries the Mail-
Rank server about the ranking of the sender address (if not cached
locally) and classiﬁes / ranks the email accordingly.
Further extensions of our prototype will make use of secure sign-
ing schemes to enable us to analyze both outgoing and incoming
emails for extracting the ‘votes’ and submitting them to the Mail-
Rank server.
8
This helps not only to bootstrap the system initially,
but also introduces the votes of spammers into MailRank. Such
votes have a very positive aspect, since they increase the score for
the spam recipients (i.e., non-spammers). Thus, spammers face
more difﬁculties to attack the system and increase their own rank.
3.5 MailRank Under Spammer Attacks
By deﬁnition, spammers send the same / very similar message
to very many (typically millions of) recipients. However, they can
run two different strategies to choose the sender address: First, they
use a new (random) email address for each spam message even if
they send the same message to millions of recipients (from an anal-
ysis we performed on the autowhitelists of several large university
institutions in Germany, we found that 95% of the spammer ad-
dresses were used only once). In this manner, they are trying to
circumvent blacklists of email addresses. Furthermore, they use
these addresses only for sending spam emails to non-spammers.
Second, they use email addresses from well-known non-spammers
(forging of sender address) assuming that these addresses are in
the whitelists of many spam detection tools. Sender authentica-

tion schemes as those listed in Sect. 2 already prevent forging the
sender address (when installed on the email server) and are actually
required for any whitelist-based scheme. However, sender authen-
tication cannot counteract the much more common former spam-
ming strategy.
As soon as the MailRank service becomes widespread, spam-
mers will surely try to attack it in order to increase the rank of their
own address(es). We identiﬁed and simulated several ways of at-
tacking MailRank
9
. For example, spammers could issue votes from
one or several spammer addresses to one or several non-spammer
addresses. However, the algorithm ensures that it is not possible to
change your own score by the votes you are issuing towards others.
Therefore, such attacks are only reasonable if the spammers vote
for another spammer address to increase its rank, forming a mali-
cious collective (cf. Fig. 2). This is comparable to link farming in
the Web in order to attack PageRank. However, recently there has
been an extensive amount of work on identifying and neutralizing
such attacks on power iteration algorithms (see for example [20]),
and thus the threat they represent to social reputation schemes has
been signiﬁcantly reduced.
1
0
N2 3
Figure 2: Malicious collective: nodes 2–N vote for node 1 to
increase the rank of node 1 and node 1 itself votes for node 0,
the email address that is ﬁnally used for sending spam emails.
Another possible attack is to make non-spammers vote for spam-
mers. To counter incidental votes for spammers (e.g., because of a

misconﬁgured vacation daemon), an additional conﬁrmation pro-
cess could be required if a vote for one particular email address
8
Analyzing incoming votes raises more security issues since we
need to ensure that the sender did indeed vote for the recipient, i.e.,
the vote / email is not faked. This can be achieved by relying on /
extending current sender authentication solutions.
9
We refer the reader to [3, 12] for a discussion about attacks in
other environments, such as P2P networks, which were also useful
as a starting point for analyzing attacks in the MailRank scheme.
would move that address from ‘spammer’ to ‘non-spammer’. How-
ever, spammers could still pay non-spammers to send spam on their
behalf. Such an attack can be successful initially, but the rank of
the non-spammer addresses will decrease after some time to those
of spammers due to the limited life time of votes. We will discuss
simulations based on such attack scenarios in the next section.
4. EXPERIMENTAL RESULTS
Real-world data about email networks is almost unavailable be-
cause of privacy reasons. Yet some small studies do exist, using
data gathered from the log ﬁles of a student email server [4], or of
a comany wide server [9], etc. In all cases, the analyzed email net-
work graph exhibits a power-law distribution of in-going (exponent
1.49) and out-going (exponent 1.81) links.
To be able to vary certain parameters such as the number of
spammers, we evaluated MailRank using an extensive set of simu-
lations, based on a power-law model of an email network, follow-
ing the characteristics presented in the above mentioned literature
studies. Additionally, we used an exponential cut-off at both tails
to ensure that a node has at least ﬁve and at most 1500 links to other

nodes, which reﬂects the nature of true social contacts [9]. If not
noted otherwise, the graph consisted of 100,000 non-spammers
10
and the threshold T was set to 0. In a scenario without virus infec-
tions, this is sufﬁcient to detect spammers and to ensure that non-
spammers are not falsely classiﬁed. Furthermore, we repeated all
simulations for at least three times with different randomly gen-
erated email networks to determine average values. Finally, as
personalization brought a signiﬁcant improvement only in creating
user-speciﬁc rankings of email addresses (i.e., it resulted only in
minor improvements for spam detection), we omitted it here due to
space limitations. Therefore, our analysis is focused around three
issues: Effectiveness in case of very sparse MailRank networks
(i.e., only few nodes submit votes, the others only receive votes),
exploitation of spam characteristics, and attacks on MailRank.
4.1 Very Sparse MailRank Networks
In sparse MailRank networks, a certain amount of email ad-
dresses only receive votes, but do not provide any because their
owners do not participate in MailRank. In this case, some non-
spammers in the graph could be regarded as spammers, since they
achieve a very low score.
To simulate sparse MailRank networks, we created a full graph
as described above and subsequently deleted votes of a certain set
of email addresses. We used several removal models:
• All: Votes can be deleted from all nodes.
• Bottom99.9%: Nodes from the top 0.1% are protected from
vote deletion.
• Avg: Nodes having more than the average number of outgo-
ing links are protected from vote deletion.
The ﬁrst modelis rather theoretical, aswe expect thehighly-connected

non-spammers to register with the system ﬁrst
11
. Therefore, we
protected the votes of the top nodes in the other two methods from
being deleted
12
. Figure 3 depicts the percentage of non-spammers
regarded as spammers, depending on the percentage of nodes with
deleted votes, with the error bars at each point showing the mini-
mum / maximum over ﬁve simulation runs. Non-spammers regis-
10
We also simulated using 10,000 and 1,000,000 non-spammers and
obtained very similar results.
11
Such behavior was also observed in real-life systems, e.g., in the
Gnutella P2P network ( />12
The 100% from ‘Bottom99.9%’ and ‘avg’ actually refer to 100%
of the non-protected nodes.
0
10
20
30
40
50
60
70
80
90
100
0 20 40 60 80 100

Non−spammers considered as spammers [%]
Nodes with deleted outlinks [%]
Number of Non−spammers: 100000
Bottom99.9%
Random
Avg
Figure 3: Very sparse MailRank networks
tered to the system will be classiﬁed as spammers only when very
few, non-reputable MailRank users send them emails. As stud-
ies have shown that people usually exchange emails with at least
ﬁve partners, such a scenario is rather theoretical. However, as the
power-law distribution of email communication is expected only
after the system has run for a while, we intentionally allowed such
temporary anomalies in the graph. Even though for high deletion
rates (70−90%) they resulted in some non-spammers being classi-
ﬁed as spammers, MailRank still performed well, especially in the
more realistic ‘avg’ scenario (the bigger error observed in the the-
oretical ‘Random’ scenario was expected, since random removal
may result in the deletion of high-rank nodes contributing many
links to the social network). Finally, the error rate decreases fast
when the removal approaches 100%, as the numberof nodes known
to the system also decreases
13
.
4.2 Exploitation of Spam Characteristics
If we monitor current spammer activities (i.e., sending emails to
non-spammers), the emails / votes from spammers towards non-
spammers can be introduced into the system as well. This way,
spammers actually contribute to improve the spam detection capa-
bilities of MailRank: The more new spammer email addresses and

emails are introduced into the MailRank network, the higher they
increase the score of the receiving non-spammers. This can be seen
in a set of simulations with 20,000 non-spammer addresses and a
varying number of spammers (up to 100,000, cf. Fig. 4), where the
rank of the top 0.25% non-spammers increases linearly with the
number of spammer addresses included in the MailRank graph.
4.3 Attacking MailRank
In order to be able to attack MailRank, spammers must receive
votes from other MailRank users to increase their rank. As long
as nobody votes for spammers, they will achieve a null score and
will thus be easily detected. This leaves only two ways of attacks:
formation of malicious collectives and virus infections.
Malicious collectives. The goal of a malicious collective (cf.
Fig. 2) is to aggregate enough score into one node to push it into the
biasing set. If no manually selected biasing set can be used to pre-
vent this, one of the already many techniques to identify web link
farms could be employed (see for example [20]). Furthermore, we
require MailRank users willing to submit their votes to manually
13
When all users pointing to a not registered user have been deleted,
then the not registered user is no longer known to the system.
5000
10000
15000
20000
25000
30000
0 10 20 30 40 50 60 70 80 90 100
Cumulative rank of the top 0.25% non−spammers
Number of spammers [*1000]

Number of Non−spammers: 20000
Figure 4: Rank increase of non-spammer addresses
0
50
100
150
200
250
300
350
400
450
500
0 100 200 300 400 500 600 700
Rank
Number of collectives
Using malicious collectives to push spammer into biasing set
Rank of the highest spammer
Size of the biasing set
Figure 5: Automatic creation of the biasing set
register their email address(es). This impedes spammers to auto-
matically register millions of email addresses in MailRank and also
increases the cost of forming a malicious collective. To actually de-
termine the cost of such a manual registration, we have simulated
a set of malicious users as shown in Fig. 2. The resulting position
of node 1, the node that should be pushed into the biasing set, is
depicted in Fig. 5 for an email network of 20,000 non-spammers,
malicious collectives of 1000 nodes each, and an increasing num-
ber of collectives on the x-axis. When there are few large-scale
spammer collectives, the system could be relatively easy attacked.

However, as users must manually register to the system, forming a
collective of sufﬁcient size is practically infeasibile. Moreover, in
a real scenario there will be more than one malicious collective, in
which case pushing a node into the biasing set is almost impossi-
ble: As shown in Fig. 5, it becomes more difﬁcult for a malicious
collective to push one node into the biasing set, the more collec-
tives exist in the network. This is because the spammers registered
to the system implicitly vote for the non-spammers upon sending
them (spam) emails.
Virus infections. Another possible attack on MailRank is to
use virus / worm technology to infect non-spammers and make
them vote for spammers. We simulated such an attack according
to Newman’s studies [14], which showed that when the 10% most
connected members of a social network are not immunized (e.g.,
using anti-virus applications) worms would spread too fast. Sim-
0
5000
10000
15000
20000
25000
0 10 20 30 40 50 60 70 80 90 100
Position on rank list / # of non−spammers
% of email addresses voting for spammers
Evaluation of virus attack (20000 non−spammer, 10000 spammer)
Highest position of spammer
Number of non−spammers with rank > 20000
Figure 6: Simulation results: Virus attack
ulation results are shown in Fig. 6 with a varying amount of non-
spammers voting for 50% of all spammers. If up to about 25%

of the non-spammers are infected and vote for spammers, there is
still a signiﬁcant difference between the ranks of non-spammers
and spammers, and no spammer manages to get a higher rank than
the non-spammers. If more than 25% non-spammers are infected,
the spammer with the highest rank starts to move up in the rank list
(the upper line from Fig. 6 descends towards rank 1). Along with
this, there will be no clear separation between spammers and non-
spammers, and two threshold values must be employed: one Mail-
Rank score T
1
above which all users are considered non-spammers
and another one T
2
< T
1
beneath which all are considered spam-
mers, the members having a score within (T
1
, T
2
) being classiﬁed
as unknown.
5. CONCLUSIONS AND FUTURE WORK
This paper investigated the feasibility of MailRank, a new email
ranking and classiﬁcation scheme, which intelligently exploits the
social communication network created via email interactions. On
the resulting email network graph, a power-iteration algorithm is
used to rank trustworthy senders and to detect spammers. Mail-
Rank performs well both in the presence of very sparse networks:
Even in case of a low participation rate, it can effectively distin-

guish between spammer email addresses and non-spammer ones,
even for those users not participating actively. MailRank is also
very resistant against spammer attacks and, in fact, has the prop-
erty that when more spammer email addresses are introduced into
the system, the performance of MailRank increases.
Based on these encouraging results we are currently investigat-
ing several future improvements for our algorithms. We intend to
move from a centralized system to a distributed one to make the
system scalable for a large-scale deployment. We are currently in-
vestigating a DNS-like system, in which the computation is han-
dled in a distributed manner by several servers. Finally, another
approach would be to consider each email client as a peer in a P2P
network, and run a distributed approach to MailRank as such.
6. REFERENCES
[1] P.O. Boykin and V. Roychowdhury. Leveraging social
networks to ﬁght spam. IEEE Computer, 38(4):61–68, 2005.
[2] Paul-Alexandru Chirita, Daniel Olmedilla, and Wolfgang
Nejdl. Finding related pages on the link structure of the
www. In Proceedings of the 3rd IEEE/WIC/ACM
International Web Intelligence Conference, Sep 2004.
[3] A. Clausen. The Cost of Attack of PageRank. In Proc. of the
International Conference on Agents, Web Technologies and
Internet Commerce (IAWTIC), Gold Coast, 2004.
[4] H. Ebel, L. I. Mielsch, and S. Bornholdt. Scale-free topology
of email networks. Physical Review E 66, 2002.
[5] D. Geer. Will new standards help curb spam? IEEE
Computer, pages 14–16, February 2004.
[6] J. Golbeck and J. Hendler. Reputation Network Analysis for
Email Filtering. In Proc. of the Conference on Email and
Anti-Spam (CEAS), Mountain View, CA, USA, July 2004.

[7] A. Gray and M. Haahr. Personalised, Collaborative Spam
Filtering. In Proc. of the Conference on Email and
Anti-Spam (CEAS), Mountain View, CA, USA, July 2004.
[8] Z. Gy
¨
ongyi, H. Garcia-Molina, and J. Pendersen. Combating
web spam with trustrank. In Proceedings of the 30th
International VLDB Conference, 2004.
[9] B. A. Huberman and L. A. Adamic. Information dynamics in
the networked world. Complex Networks, Lecture Notes in
Physics, 2003.
[10] Isode. Benchmark and comparison of spamassassin and
m-switch anti-spam. Technical report, Isode, April 2004.
[11] G. Jeh and J. Widom. Scaling personalized web search. In
Proc. of the 12th Intl. WWW Conference, 2003.
[12] S. Kamvar, M. Schlosser, and H. Garcia-Molina. The
EigenTrust Algorithm for Reputation Management in P2P
Networks. In Proc. of the 12th Intl. WWW Conference, 2003.
[13] J.S. Kong, P.O. Boykin, B.A. Rezaei, N. Sarshar, and
V. Roychowdhury. Let your CyberAlter Ego Share
Information and Manage Spam. Technical report, University
of California, USA, 2005. Preprint.
[14] M. E. J. Newman, S. Forrest, and J. Balthrop. Email
networks and the spread of computer viruses. Physical
Review E 66, 2002.
[15] L. Page, S. Brin, R. Motwani, and T. Winograd. The
pagerank citation ranking: Bringing order to the web.
Technical report, Stanford University, 1998.
[16] M. Perone. An overview of spam blocking techniques.
Technical report, Barracuda Networks, 2004.

[17] P. Resnick and H.R. Varian. Recommender Systems.
Communications ACM, 40(3):56–58, 1997.
[18] M. Richardson, R. Agrawal, and P. Domingos. Trust
management for the semantic web. In Proceedings of the 2nd
International Semantic Web Conference, 2003.
[19] G.L. Wittel and S.F. Wu. On Attacking Statistical Spam
Filters. In Proc. of the Conference on Email and Anti-Spam
(CEAS), Mountain View, CA, USA, July 2004.
[20] B. Wu and B. Davison. Identifying link farm spam pages. In
Proc. of the 14th Intl. WWW Conference. ACM Press, 2005.
[21] C. Ziegler and G. Lausen. Spreading activation models for
trust propagation. In Proc. of the IEEE Intl. Conference on
e-Technology, e-Commerce, and e-Service, 2004.

01- mailrank using ranking for spam detection

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về