Tải bản đầy đủ (.pdf) (28 trang)

Nghiên cứu xây dựng một số giải pháp đảm bảo an toàn thông tin trong quá trình khai phá dữ liệu bản tóm tắt tiếng anh

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (281.5 KB, 28 trang )











































B GIÁO DC VÀ ÀO TO B QUC PHÒNG
VIN KHOA HC VÀ CÔNG NGH QUÂN S








LNG TH DNG





DISTRIBUTED SOLUTIONS IN PRIVACY
PRESERVING DATA MINING
(Nghiên cu xây dng mt s gii pháp đm bo an
toàn thông tin trong quá trình khai phá d liu)


Chuyên ngành: Bo đm toán hc cho máy tính và h thng tính toán.

Mã s : 62 46 35 01






TÓM TT LUN ÁN TIN S TOÁN HC





Hà Ni, 2011
Chapter 1
INTRODUCTION
1.1 Privacy-preserving data ming: An overview
Data mining plays an important role in the current world and provides us a
powerful tool to efficiently discover valuable information from large databases
[Han and Kamber, 2006]. However, the process of mining data can result in a vi o -
lation of privacy. As a result, there are a large number of studies has been produced
on the topic of privacy-preserving data m i n i n g (PPDM) [ Veryki o s et al., 2004] .
These studies deal with the problem of lea r n i n g data mini n g models from the
databases, while protecting data privacy at the individual or organizational level.
Basically, PPDM can be formed into three following areas [Charu and Yu, 2008]:
The first area is privacy-preserving data publishi n g . Studies in th i s area are to
allow an organizati o n (party) to publish his data to the miners with a concern that

how to p u b l i sh the data so that the a n o nymized data are useful for data mining
applications. The second area is the privacy-preserving distributed data mining,
the model of this area usually consists of several parties, each party has a private
data set. The general purpose is to enable th e parties to mine cooperatively on
their joint data sets withou t revealing priva t e information of each party. Here,
data coul d be distributed into many p a r t s either vertically or horizontally. The
third area is a scenario in whi ch a data miner surveys a large number of users to
learn some results based on the user d a t a or collects the user data while protecting
the sensitive attributes of these users.
1.2 Contributions
Up to now, there are many available solution s to solving the issues in PPDM
[Kargupta et al., 2003], [Dowd et al., 2005], [Vaidya et al., 2008] etc. The quality
of each solution is evaluated based on the three basic characteristics: pri vacy de-
gree, accuracy, and efficiency. But the problem here is that each solution was only
used in a particular distributed scenario or i n a concrete data mining algorithm.
Although some of them can be applied for mor e than one scenar i o or algorithm
but thei r accuracy is lower than a ccep t a b l e requirement. Other solutions reach
the accuracy, however, their p r i vacy i s poor . In addition, it is easily to see the
lack of PPDM solutions for various practical context as well as well-known data
mining techniques. In summary, the key contributions of the thesis are as follows:
1. First work is to intro d u ce a new scenari o for privacy-preserving user data
1
mining called 2-part fully distributed setting (2PFD) and find solution for a
family of frequency-based learning algorith m s in 2PFD setting
2. Second work is to develop novel privacy-preserving protocols for frequent
itemset mining in vertically distributed data. The important security prop-
erty of our pro t ocols is better than the previous protocols’ one in the way
that we achieve the full privacy protection for each party. This property does
not require the existence of any of trusted parties. In additio n , no collusion
of parties can make privacy breaches

3. Third work is firstly to develop a privacy preserving EM-based clusterin g
protocol for multi-party model. Our protoco l is mo r e secure th a n th e exi st i n g
ones with the coll u si o n resistance. In addition, our protocol works not only
for three parties and above but also for two parties. Secondly, we propose a
better protocol for the case in wh i ch the d a t a set is horizontally partitioned
into only two parts. This protocol requires protecting privacy of the cluster
centers.
4. Forth work is a technique to design pr o t ocols for pr i vacy- preserving multi-
variate outlier detection in both horizontally and vertically distributed data.
The developed solutions will be evaluated in terms of the degree of privacy
protection, correctness, efficiency a n d sca l a b i l i ty. The contributions of thi s t h e-
sis are solutions for four pro b l em s in PPDM. Each problem has an independent
statement to the others, but they sh a r e a co m m o n fr a m ework. This framework
can be simply i m a g i n ed as: we needs to find the knowledge from a distri b u t ed
dataset while the privacy preserving of involved parties is must be guaranteed.
The difference of each problem is the way we obtain the dataset from distributed
parties and the proposed function to keep the privacy informati o n for users.
1.3 Organization of thesis
The thesis con si st s of six chapters, 109 pages of A4. Chapter 1 introduces an
overview of PPDM and related works. Chapter 2 presents th e basic definitions
of secure multi-party computation and the techniques I frequently use. Chapter
3 proposes privacy preserving frequ en cy -b a sed learning protocols in 2PFD. Chap-
ter 4 presents two privacy-preserving protocols for distributed mi n i n g of frequent
itemsets. Chapter 5 discusses privacy preserving EM-based clustering protocols.
Chapter 6 pr esents the technique to design protocols of privacy preserv i n g outlier
detection for bot h vertically a n d horizontally distribut ed data, and we gi ve the
conclusion in the last section of this thesis.
2
Chapter 2
METHODS FOR SECURE MULTI-PARTY COMPUTATION

In this thesis, we use secure multi-party computation (SMC) an d cryptograp h i c
tools as t h e building blocks to desig n privacy-preser v i n g data mining protocol s.
Before discussing in detai l s, in this chapter, we first review some important defini-
tions of SMC. Then, we summarize the techniques which will be used in the next
chapters.
2.1 Definitions
In this section, we review basic definitions from computational complexity theory
and SMC that will be used in this thesis [Goldreich, 2004].
Definition 2.1. Let N be the set of natural numbers. We say the function ǫ(·) :
N → (0, 1] is negligible in n, if for every positive integer polynomial poly(·) there
exists an integer n
0
> 0 such that for all n > n
0
ǫ(n) <
1
poly(n)
The computational indistinguishability is another im portant concept when dis-
cussing the security properties of distribut ed pro t ocols [Goldreich, 2004]. Let
X = {X
n
}
n∈N
is an ensemble indexed by a secu r i ty parameter n (which usually
refers to the length of the input), where the X

i
s are random variables.
Definition 2.2. Two ensembles, X = {X
n

}
n∈N
and Y = {Y
n
}
n∈N
, are compu-
tational indistinguishable in polynomial time if for every probabilistic polynomial
time algorithm A,
|P r(A(X
n
) = 1) − P r(A(Y
n
) = 1)|
is a negligible function in n. I n such case, we write X
c
≡ Y , where
c
≡ denotes
computational indistinguishability.
Secure multiparty computation f unct i o n: In a distributed network with n
participating parties. A secure n-party computation problem can general l y be
considered as a computation of a function:
f(x
1
, x
2
, , x
n
) → (f

1
(x
1
, x
2
, , x
n
), , f
n
(x
1
, x
2
, , x
n
))
where each party i knows only it s private input x
i
. For security, it is required
that the privacy of any honest party’s input is protected, in the sense that each
3
dishonest party i learns no t h i n g except its own output y
i
= f
i
(x
1
, x
2
, , x

n
). If
there is any malicious party that may deviate from the protocol , it is also required
that each honest party get a correct result whenever possible.
Privacy in Semi-honest model: In th e distribut ed setting, let π be an n-
party protocol for computing f. Let
x denote (x
1
, , x
n
). The view of the i
th
(i ∈ [1, n]) party during an execution of π on
x is denoted by view
π
(x) which
includes x
i
, all r ecei ved messages, and all internal coin flips. For every subset
I of [1, n], namely I = {i
1
, , i
t
}, let f
I
(x) d en o t e (y
i
1
, , y
i

t
) and view
π
I
(x) =
(I, view
π
i
1
(
x), , view
π
i
t
(
x)). Let OUTPUT (x) denotes the output of all parties
during the execution of π.
Definition 2.3. An n-party computation protocol π for computing f(., , .) is se-
cure with respect to semi-honest parties, if there exists a probabilistic polynomial-
time algorithm denoted by S, such that for every I ⊂ [1, n] we have
{S(x
i
1
, , x
i
t
, f
I
(
x)), f(x))}

c
≡ {view
π
I
(x), OUT P UT (x)}
This definition states that the view of the parties in I can be simulated from
only the parties’ inputs and outputs. If the function is privately computed by the
protocol, then privacy of each party’s input data is protected. In this thesis, we
focus on designing privacy-preserving protocols in the semi-honest model. The
formal definition of the security protocol in th e malicious model can be found
in [Goldreich, 2004]. In this thesis, we also use composition theorem for the semi-
honest model that its discussion and proof can be found in [Goldreich, 2004].
Theorem 2.1 (Composit i o n theorem). Suppose that g is privately reducible to f,
and that there exists a protocol for privately computing f. Then there exists a
protocol for privately computing g.
2.2 Secure computation
Secret sharing: S ecr et sharing refers to any method by which a secret can be
shared by multiple parties in such a way that no party knows the secret, but it
is easy to construct the secret by combining some parties sha r es. For example,
Shamir secret sharing scheme [Shamir, 1979] or the secure mean sharing protocol
will be described in Chapter 5.
Variant ElGamal Cryptosystem: Our Protocols in Chapter 3 and 4 are based
on the standard variant of the ElGamal encryption scheme. ElGamal encryption
is semantically secure u ‘ nder the deci si o n a l Diffie-Hellman (DDH) assumption
[Boneh, 1998]. The computations are carried out in Z
p
and the message space is
4
Z
q

, where p and q are prime, and q|(p − 1) . We briefly review the variant of the
ElGamal encryption scheme as follows.
Let G be a cyclic group of order q (G is a sub group of Z

p
). Let g be a generator
of G, f ∈ g is ran d o m l y selected, and x be uniformly chosen in [1, q − 1]. In
ElGamal encryption schema, x is a private key and the public key is h = g
x
. Each
user securely keeps their own private keys, other wi se public keys are publicly
known.
To encrypt a message m using the public key h, one randomly chooses k in
[1, , q − 1] and then computes the ciphertext C = ( C
1
= f
m
h
k
, C
2
= g
k
). The de-
cryption of the ciphertext C with the private key x can be executed by computing
f
m
= C
1
(C

x
2
)
−1
, and find m from f
m
.
Decisional Diffie-Hel l m a n Assumption. For uniformly random a, b, c ∈ [0, q−
1], the DDH assumption is that {g
a
, g
b
, g
ab
}
c
≡ {g
a
, g
b
, g
c
}
Oblivious polynomial evaluation (OPE)[Naor and Pinkas, 1999]: This
problem involves a sender (Alice) and a receiver (Bob) . The sen d er ’ s input is
a polynomia l P(y) =

k
i=0
a

i
y
i
of degree k over some finite field F and the re-
ceiver’s input is an element x ∈ F (the degree k of P is public). The p r o t ocol
is such that the receiver obtains P (x) wit h o u t learning anything else abou t the
polynomial P , and the sender learns nothing. In other words, an OPE protocol is
to compute the following function:
(P (y), x) → (∅, P(x))
Secure scalar product sharing (SSP): Assume that two vectors A = (a
1
, , a
n
)
and B = (b
1
, , b
n
) ar e owned by two corresponding parties Alice and Bob. A
privacy-preserving scalar product sharing protocol is to allow Alice to learn r
A
and Bob to learn r
B
, where r
A
and r
B
are random integers, called shares, between
0 and M − 1 such that r
A

+ r
B
mod M = A · B (where A · B ∈ [0, M]).In other
words, a SSP protocol is to compute the following function:
(A, B) → (r
1
, r
2
)|r
1
+ r
2
= A · B
Privately computing ln x [Kantarcioglu, 2005]: In secure multi-party mean
computation, we need to be able to privately share ln x, where x = x
1
+ x
2
with
x
1
known to P
1
and x
2
known to P
2
. Thus, P
1
should get y

1
and P
2
should get y
2
such th a t y
1
+ y
2
= ln x = ln (x
1
+ x
2
). In other words, a protocol for computing
ln (x) is to construct the following function:
(x
1
, x
2
) → (y
1
, y
2
)|y
1
+ y
2
= ln (x
1
+ x

2
)
5
Chapter 3
PRIVACY PRESERVING FREQUENCY-BASED LEARNING IN
2PFD SETTING
3.1 Introduction
In this chapter, we consider privacy p r eser v i n g frequency-based learning in a so-
called 2-part fully distributed setting (2PFD). In this scenario, the dataset is
distributed across a larg e number of users i n which each record is owned by two
different users, one user only knows the values for a subset of attributes, while
the other knows th e values for the remaini n g attributes. A miner ai m s to lea r n
frequency-based models from their data, while preserving each user’s sensitive
attributes. Some solutions based on ran d o m i za t i o n techniques can address this
problem, but suffer from the tradeoff between privacy and accuracy. In this chap-
ter, we develop a cryptog r a p h i c method that ensures each user’s privacy without
loss of accuracy. Our key contribution is the privacy preserving frequency com-
putation method in 2-part fully dist r i b u t ed setting. To illustrat e the app l i ca b i l i ty
of this method, we used i t to build the privacy preserving protocol for the naive
Bayes classifier learning and sh ow its other applications. The experimental results
show that our protocol is very efficient.
3.2 Privacy preserving frequency mining in 2PFD setting
3.2.1 Problem formulation
The frequ en cy computation problem in 2PFD can be formulated into the m o r e
simple problem as follows.
Assume that there are n pairs of user s (U
i
, V
i
), each U

i
has a binary number
u
i
and each V
i
has a binary numb er v
i
. The pri vacy-preserving frequency com-
putation problem is to allow a miner t o compute f =

u
i
v
i
without disclosing
any informa t i o n abou t u
i
and v
i
. In other words, we need a privacy -p r eser v i n g
protocol for constructing the following function:
(u
1
, v
1
, , u
n
, v
n

) →

u
i
v
i
The definitio n notation implies th a t each pair U
i
and V
i
provide inputs u
i
and v
i
to the protocol, and the miner receive output

u
i
v
i
without any other
information.
6
3.2.2 Definition of privacy
The definition of privacy given below can be viewed as a simplification of the
general definition in the semi-hon est model [ Go l d r ei ch, 2004], Basical l y, the defi-
nition states that the computation is secure if the joint view of t h e miner and
the corrupted users (the t
1
users U

i
and the t
2
users V
i
) during the execution of
the protocol can be effectively simulated by a simulator, based on what the miner
and the corrupted users have observed in the protocol using only the result f, the
corrupted users’ knowledge, and the publi c keys. Therefore, the miner and the
corrupted users can not learn anything from f.
3.2.3 Frequency mining protocol
Our protocol is designed based on the homomorphic property of a variant of El-
Gamal encryption [Hir t an d Sako, 2000]. The privacy of our protocol is based on
the semantic property of ElGamal encryption scheme under the DDH assumption,
which has been introduced in the previo u s chapter.
Let p and q be two primes such that q|(p − 1), let G be a subgroup of Z

p
of
order q, and g is a generator of G. In the proposed protocol, we assume that each
user U
i
has private keys x
i
, y
i
uniformly chosen from {1, , q − 1}, and public keys
X
i
= g

x
i
, Y
i
= g
y
i
. Each user V
i
has private keys p
i
, q
i
and public keys P
i
= g
p
i
,
Q
i
= g
q
i
. We note that computations in this thesis always take in Z
p
. We define
X =
n


i=1
X
i
P
i
= g
x
and Y =
n

i=1
Y
i
Q
i
= g
y
where x =
n

i=1
(x
i
+ p
i
) and y =
n

i=1
(y

i
+ q
i
). In the proposed protoco l , X and Y
are known by all users. Our protoco l presented in 3.1.
3.2.4 Correctness and Privacy Analysis
In t h e thesis, we proved the correctness of the protoco l and we showed that under
the semantic security property of the ElGamal encryption scheme, our prot ocol
preserves each user’s privacy in the semi-honest model.
Theorem 3.1. The protocol presented in figure 3.1 correctly computes the fre-
quency value f =

n
i=1
u
i
v
i
as defined in Subsection 3.2.1.
Theorem 3.2. Assuming that f < n, the protocol in Figure 3.1 preserves the
privacy of the honest users against the miner and up to 2n − 2 corrupted users.
In cases with only two honest users, the conclusion remains correct as long as two
honest users do not hold the attribute values of the same record.
7
• Phase 1. Each user U
i
does as follows:
– Randomly choose k
i
from {1, , q − 1}.

– Computes C
(i)
= (C
(i)
1
, C
(i)
2
) = (g
u
i
X
k
i
i
, g
k
i
)
– Send C
(i)
to the miner
• Phase 2. Each user V
i
does the follows:
– Get C
(i)
from the miner
– Randomly choose r
i

from {1, , q − 1}
– if v
i
= 0 then compute R
(i)
= (R
(i)
1
, R
(i)
2
, R
(i)
3
)=(X
r
i
i
X
q
i
, g
r
i
, Y
p
i
)
– if v
i

= 1 then compute R
(i)
= (R
(i)
1
, R
(i)
2
, R
(i)
3
)=(g
u
i
X
r
i
+k
i
i
X
q
i
, g
r
i
+k
i
, Y
p

i
)
– Send R
(i)
to the miner.
• Phase 3. Each user U
i
does as follows:
– Get R
(i)
from the miner.
– Compute K(u
i
, v
i
) = (K
(i)
1
, K
(i)
2
) = (R
(i)
1
(R
(i)
2
)
−x
i

X
y
i
, R
(i)
3
Y
x
i
)
– Send K(u
i
, v
i
) to the miner
• Phase 4. The miner does as follows:
– Compute d =
n

i=1
K
(i)
1
K
(i)
2
– Find f from {0, 1, , n} that satisfies g
f
= d
– Output f

Figure 3.1: Frequency mining protocol
3.2.5 Efficiency of frequency mining protocol:
The computatio n a l cost of each user U
i
in the first phase and in the third phase
are 2 and 3 modular exponentiations, respectively. The com p u t a t i o n a l cost of
each user V
i
in the second phase is at most 3 modular exponentiations. The miner
uses 2n modu l a r multiplications and at most n comparisons. For evaluating the
efficiency of the protocol in practice, we conducted an experiment using the C#
language on a PC. We measur e the computation cost of the protocol for n from
1000 to 5000. Befor e executing the protocol, we generate the pairs of keys for each
user, with the size of p and q set at 1024 bits and 160 bits, and compute values
X and Y . The results show that the average time used by each U
i
for computing
the first-phase messages and the third-phase messages are about 21ms and 29ms,
respectively. Each V
i
needs about an average 32ms to compute her messages. The
miner’s time are very efficient and nearly linearly related to n such as when n =
5000, the miner uses only about 460 ms for the computation.
8
3.3 Frequency-based Learning in 2PFD Setting
The method of frequency mining is very useful in privacy preserving data mining
applications that its learn i n g is based on frequency such as naive Bayes, associ a t i o n
rules mining, ID3 learnin g , Pearson correlation a n a l y si s et c. In this thesis, we
demonstrated the useful of frequency mining method by using it as a primi t i ve to
design a privacy-preserving protocol for naive Bayes learning.

3.4 An improvement of frequency mining protocol
3.4.1 Improved frequency mining protocol
A problem of the frequency mi n i n g protocol is th a t a sing l e client may be able
to d i sr u p t the system. Thus, our purpose is to improve the frequency mini n g
protocol. That is, only a set S of t user pairs can obtain the frequency without
requiring the presence of all users, where t ≥ k, k is the defined thr esh o l d . We
expand the idea of threshold decryption system [Noack and Spitz, 2009] to solve
the above problem. For a (n, k) threshold scheme, the basic idea is that a private
key is shared among n users by using a (n, k)-Shamir secret sharing, so that only
a set T of k users involves in the protocol, miner can decrypt a ciphertext by using
Lagrange interpolat i o n with o u t exp l i ci t l y recon st r u ct i n g th e private key.
In proposed protocol, we assume th a t two key seeds x
0
and p
0
∈ [1, q − 1] are
shared among n user s U
i
and n users V
i
by a (n, k)-Sh a m i r secret sharing. Shares
owned by U
i
and V
i
are x
i
= f(i) and p
i
= h(i) respectively, where f(x) and

h(x) are the r a n d o m polynomials of degree (k − 1) ∈ Z
q
such that f(0) = x
0
and
h(0) = p
0
. Thus, each user U
i
has the key p a i r (x
i
, X
i
= g
x
i
) and V
i
has (p
i
,
P
i
= g
p
i
). In our protocol, H = g
x
0
+p

0
is announced as the general public key. The
detailed phases of the improved frequency mining are presented in Figure 3.7
3.4.2 Protocol Analysi s
Different fr o m the previous protocol, the private keys y
i
and q
i
of the improved
protocol are temp keys that are chosen at the encry p t i n g time. The general keys Y
replaced by g and X replaced by H. This protocol preserves privacy of each user
gainst up to 2k − 2 corrupted users. In the improved protocol, the computational
complexity of these users increases a modul a r exponentiation. The computational
complexity for miner is nearly equal to the previous protocol.
3.5 Conclusion
In this chapter, we p r o posed a method for privacy preserving frequency-based
learning in 2PFD setting, which has not been investigated previously. Basically,
the proposed method is based on ElGamal encryption scheme, and it can provide
9
• Phase 1. Each user U
i
does as follows:
– Randomly choose k
i
from {1, , q − 1}.
– Computes C
(i)
= (C
(i)
1

, C
(i)
2
) = (g
u
i
X
k
i
i
, g
k
i
)
– Send C
(i)
to the miner.
• Phase 2. Each user V
i
does the follows:
– Get C
(i)
from the miner,
– Randomly choose r
i
and q
i
from {1, , q − 1},
– if v
i

= 0 then compute R
(i)
= (R
(i)
1
, R
(i)
2
, R
(i)
3
)=(X
r
i
i
H
q
i
, g
r
i
, g
q
i
)
– if v
i
= 1 then compute R
(i)
= (R

(i)
1
, R
(i)
2
, R
(i)
3
)=(g
u
i
X
r
i
+k
i
i
H
q
i
, g
r
i
+k
i
, g
q
i
)
– Send R

(i)
to the miner.
• Phase 3. Each user U
i
does as follows:
– Get R
(i)
from Miner.
– Randomly choose y
i
from {1, , q − 1},
– Compute K
(i)
= (K
(i)
1
, K
(i)
2
) = (R
(i)
1
(R
(i)
2
)
−x
i
H
y

i
, R
(i)
3
g
y
i
) .
– Send K
(i)
to Miner.
• Phase 4. Miner computes K =

i∈S
K
(i)
2
• Phase 5. The users does as follows:
– Each U
i
computes a
i
= K
x
i
and sends a
i
to Miner
– Each V
i

computes b
i
= K
p
i
and sends b
i
to Miner
• Phase 6. Miner does as follows:
– Compute K

=

t∈T
(a
t
b
t
)

j∈T,j=t
−j
t−j
– Compute d =

n
i=1
K
(i)
1

K

.
– Find f from {0, 1, , n} that satisfies g
f
= d
– Output f.
Figure 3.7: Improved frequency mining protocol
strong privacy without loss of accuracy. We illustrated the applicability of the
method by ap p l y i n g it to designing t h e privacy preserving protocol for naive Bayes
learning in 2PFD setting. We conducted experiments to evaluate the complexity of
the protocols, and the results showed that, the protocols are efficient and practical
as well. We discussed an improvement of the protocol using Shamir sharing scheme
to allows the miner to obtain frequency without requiring t h e full part i ci p a t i o n of
n user pairs.
10
Chapter 4
ENHANCING PRIVACY FOR FREQUENT ITEMSET MINING IN
VERTICALLY DISTRIBUTED DATA
4.1 Introduction
In this chapter, we present the protocols for vertically partit i o n ed data: a por-
tion of each transaction is present at each party, but no party contains complete
information for any transaction. The several protoco l s have been proposed for
this problem [Zhong, 2007, Vaidya and Clifton, 2005, Han and Ng, 2007]. How-
ever, some of them only resist the collusio n at most n − 2 corrupted parties among
n participants, while other ones require at least one non-collusion party. We pro-
pose the p r o t ocols for pr i vacy-preserving frequent itmeset mi n i n g that does not
require any trusted party wh i l e they can protect the privacy of each party gainst
the collusio n of any group of corrupted parties. In addition, we give two protocols
that allow the parties to be able to select one of two privacy level corresponding to

two protoco l s, one of them reveals only the su p port count, and the o t h er reveals
nothing.
4.2 Problem statement
The association rules and frequent itemsets mining problem is formally stated in
[Cheung et al., 1996]. Given a database D with m transactions, the pr o b l em is to
find t h e association rules that have an im p l i ca t i o n of the form X ⇒ Y , where X
and Y are the subsets of the set of items of D, and X ∩ Y = φ. An itemset X is
frequent if its support count (the number of transactions contains X) is not less
than the minimum support count t. The main technical problem in associa t i o n
rules mining is to find frequent itemsets.
Assume that D is vertically distributed on n parties P
1
, . , P
n
, the paries wish
to find the frequent itemsets from D, where D is called the joint data set of all
parties. Our aim is to design di st r i b u t ed protoco l s to obtain the frequent itemsets
while preserving p r i vacy of each pa r ty’s data. We consider privacy as protecting
individual da t a records as well as protecting i n fo r m a t i o n about the local support
count of the frequent itemsets of each party and even the global support count of
the joint database. The frequent itemset identifying problem can be for mulated
as follows.
In a distributed setting with n part i es, each party P
i
has a private vector
U
i
= (u
i1
, , u

im
), where each u
ij
∈ {0, 1}, i = 1, , n, an d j = 1, , m. For a
11
public t h r esh o l d t, the privacy-preserving frequent itmset identifying problem is
to check if s =

m
j=1

n
i=1
u
ij
≥ t without disclosing any privacy inform a t i o n of
participants.
4.3 Privacy definition
The privacy preservation of this proposed protocol is based on the semi-honest
security model [Goldreich, 2004]. Thus, we gave the privacy definition for the pro-
posed protoco l with the following parameter model. There are n par t i es involved
in the protocol. Each party P
i
has the private input U
i
, where U
i
is a the b i n a r y
vector. We assu m e that pri o r to the proto co l , each p a r ty h a s obtained the key
pairs for the Elgamal encryption scheme: the private key x

i
and the public key y
i
.
Each party’s public key has k n own by members in the system, while the pr i vate
key is secretly kept. Basically, this definition is similar to the Definition 2.3, but in
the view of each party includes the public keys of other parties, a n d each party’s
private key is a component of its input.
4.4 Support count preserving protocol
4.4.1 Overview
Assuming that X is an frequent itemset, we have t ≤ s ≤ m. Thus, there exists
a 0 in the list λ ={λ
1
= s − 1 − t, λ
2
= s − 2 − t, , λ
k
= s − k − t}, where
k = m − t. If s i s known by all parties, this problem can be solved im m ed i a t el y.
However, for our purpose with strong pri vacy, this vector cannot be revealed.
Therefore, the basic idea of the protocol is follows. Let p and q be two primes
such t h a t q|(p − 1), l et G be a subgroup of Z

p
of order q, and g is a generator of
G. All computations in th i s chapter always take in Z
p
. The proposed protocol is
to implement the following function
(U

1
, U
2
, , U
n
) → (g
r
1
λ
π (1)
, , g
r
k
λ
π (k)
)
where (λ
π(1)
, , λ
π(k)
) is a r a n d o m permutation of (λ
1
, , λ
m
). Assume that
r
j
=

n

i=1
r
ij
, where r
ij
is uniformly generated from [1, q − 1] by Party i. Then,
the parties can check whether existing λ
j
= 0 that is equal with g
r
j
λ
j
= g
0
= 1.
Clearly, when λ
j
= 0, g
r
j
λ
j
is a random number, the protoco l d oes not leak any
other information except the final result. To achieve this goal, we use extensively
two following techniques:
The joint decry pt i o n technique [Hirt and Sako, 2000]: We assume that
each party has a key pair (x
i
, y

i
= g
x
i
). We define y =
n

i=1
y
i
= g
x
, in our p r o t ocol,
the parties use y as a publ i c key to encr y p t their data, and ea ch message m is
12
changed to g
m
before encrypting . D ecr y p t i o n need to be jointly performed by al l
parties.
Rerandomization technique [Markus and Patrick, 1996]: A rerandomiza-
tion is multi-party pr o t ocol that involves several mix servers. The inp u t to the
protocol is a list of ciph er t ex t items {(a
1
, h
1
), (a
m
, h
m
)} and the output is a re-

encrypted, permutated list of those ciphertext items {(a

π(1)
, h

π(1)
), a

π(m)
, h

π(m)
)}.
The security of this technique is characterized by looking at these two sequences
of cipher-texts, the adversary cannot determine any informat i o n about the corre-
spondence between the new cipher-t ex t corresponding and the old cipher-text. In
the proposed protocol, we use a rerandomization technique based on the ElGa m a l
encryption, in which each party plays the role as a mix server.
4.4.2 Protocol design
The protocol is presented in Figure 4.1.
4.4.3 Correctness Analysis
Theorem 4.1. If all participants follow the protocol and there exists one plaintext
“1” in the decryption list {d
1
, , d
m
}, then t ≤ s ≤ n. If there is no plaintext
“1” existing in the decryption list, then s < t.
4.4.4 Privacy Analysis
The important security feature of our protocol which is better than the previous

method is that we do not assume the existence of any kind of trusted p a r t i es.
Moreover, no co l l u si o n of parties can possibly lead to the revelation of any pri-
vate information, unless all parties together form a single collusion, which is not
significant.
Theorem 4.2. The protocol in Subsection 4.5.2 preserves the privacy of the honest
parties against the collusion, up to n − 1 corrupted parties.
4.4.5 Performance analysis
Let the size of the parties’s key be K bits, the upper bound on the total commu-
nication cost of the pr o t ocol is O(nmK). This is equivalent to the one by Zhong
[Zhong, 2007]. The complexity of the protocol is bounded by O(mn) and expo-
nentiations an d O(mn) inversions. However, these operat i o n s can be computed
concurrently. Therefore, the overall computational complexity is O(m), which is
equivalent to the one by Zhong [Zhong, 200 7 ] .
13
Input: There are n parties, each party P
i
has U
i
= (u
i1
, u
im
) (u
ij
∈ {0, 1})
Output: Check ∈ {T rue, F alse}
• Phase 1. Encryption and connection.
For j = 1, , m
– For i = 1, , n: P
i

computes C
i
(j)
def
= (a
ij
, h
ij
) = (y
α
ij
, g
α
ij
), where α
ij
is randomly
picked from [1, q − 1].
– P
1
computes C
j
def
= (a
j
, h
j
) = (g
u
1j

a
1j
, h
1j
) and sends C
j
to P
2
.
– For i = 2, , n : P
i
computes C
j
= (a
j
, h
j
) = (a
u
ij
j
a
ij
, h
u
ij
j
h
ij
) and sends C

j
to
P
i+1 (mod n)
P
1
computes C = (a, h) = (

m
j=1
a
j

m
j=1
h
j
) and broadcasts it.
• Phase 2. Encryption randomization. For j = 1, , k
– For i = 1, , n : P
i
computes C
i
(j) = (a
ij
, h
ij
) = (a/g
j+t
)

r
ij
, h
r
ij
), where r
ij
is
uniformly chosen from [1, q − 1],
– P
1
sets C
j
= (a
j
, h
j
) = (a
1j
, h
1j
) and sends C
j
to P
2
,
– For i = 2, , n : P
i
computes C
j

= (a
j
, h
j
) = (a
j
a
ij
, h
j
h
ij
) and sends C
j
to P
i+1 (mod n)
• Phase 3. Randomization and permutation. For i = 1, , n
– P
i
computes: for j = 1, , k, R
j
= (R
(1)
j
, R
(2)
j
) = (a
π
i

(j)
y
δ
π
i
(j)
, h
π
i
(j)
g
δπ
i
(j)
), set s C
j
=
R
j
and sends C
j
to P
i+1 (mod n)
. Here π
i
is an permutation on {1, , k} and δ
π
i
(j)
is

uniformly chosen from [1, q − 1].
• Phase 4. Decryption.
For j = 1, , k
– For i = 1, , n : P
i
computes h
ij
= (h
j
)
x
i
– P
1
sets h
j
= h
1j
and sends it to P
2
– For i = 2, , n : P
i
computes h
j
= h
ij
h
j
, and sends h
j

to P
i+1 (mod n)
– P
1
computes d
j
= a
j
/h
j
, if d
j
= 1 then Check = T rue else Check = F alse
P
1
outputs Check
Figure 4.1: Support count preserving protocol.
4.5 Support count computation-based protocol
4.5.1 Overview
Recall that, the problem is to decide if s =

m
j=1

n
i=1
u
ij
≥ t, where each U
i

=
(u
i1
, , u
im
) (u
ij
∈ {0, 1}) is owned by par ty P
i
. Let λ
j
=

n
i=1
u
ij
− n (j =
1, , m), note that

n
i=1
u
ij
= 1 as long as λ
j
= 0. Therefore, s is the number of
λ
j
= 0. We design a protocol to compute this value, o u r basis i d ea is that , the

parties should obtain a random permutation of the set (g
λ
π (1)
, , g
λ
π (m)
), where

π(1)
, , λ
π(m)
) is a random permutation of (λ
1
, , λ
m
). In other words, we need
14
a protocol to implement the following function
(U
1
, U
2
, , U
n
) → (g
λ
π (1)
, , g
λ
π (m)

)
If we obtain th i s result, we obtai n two goals. First, the parties can be counted
the number of λ
j
= 0 that is equal with g
λ
j
= g
0
= 1. After obtaining support
counts we can compare it to the threshold t. Second, when λ
j
= 0, g
λ
j
is a random
number, so we obtain the privacy goal as well, because the parties can not know
g
λ
j
generated from which λ
j
.
4.5.2 Protocol Design
The detailed protocol is given in Figure 4.2.
Input: There are n parties, each party P
i
has U
i
= (u

i1
, u
im
) (u
ij
∈ {0, 1})
Output: s =

m
j=1

n
i=1
u
ij
.
• Phase 1. Encryption and connection: For j = 1, , m
– For i=1, , n: each P
i
computes C
i
(j) = (a
ij
, h
ij
) = (g
u
ij
y
α

ij
, g
α
ij
), where α
ij
is
randomly chosen from [1, q − 1].
– P
1
computes C
j
= (a
j
, h
j
) = (g
−n
y
α
1j
, g
α
1j
) and sends C
j
to P
2
– For i = 2, , n: each P
i

computes C
j
= (a
j
, h
j
) = (a
j
a
ij
, h
j
h
ij
) and sends it to
P
i+1 (mod n)
• Phase 2. Randomization and permutation. For i = 1, , n,
– P
i
computes: For j = 1, , m, R
j
= (R
(1)
j
, R
(2)
j
) = (a
π

i
(j)
y
δ
π
i
(j)
, h
π
i
(j)
g
δπ
i
(j)
), aft er that
P
i
sets C
j
= R
j
and sends C
j
to P
i+1 (mod n)
. Here π
i
is an permutation on {1, , m}
and δ

j
is uniformly chosen from [1, q − 1].
• Phase 3. The key component computation : For j = 1, , m,
– For i = 1, , n : each P
i
computes h
ij
= (h
j
)
x
i
– P
1
sets h
j
= h
1j
, and sends h
j
to P
2
– For i = 2, , n : each P
i
computes h
j
= h
j
h
ij

and sends h
j
to P
i+1 (mod n)
• Phase 4. Decryption. P
1
does:
– s = 0
– For j = 1, , m : d
j
= a
j
/h
j
if d
j
= 1 then s = s + 1;
– Output s
Figure 4.2: The support count computation protocol.
4.5.3 Correctness Analysis
Theorem 4.3. If all participants follow the protocol, the number of plantexts “1”
in the decryption list {d
1
, d
2
, , d
m
} is the support count.
15
4.5.4 Privacy Analysis

Theorem 4.4. The protocol in Subsection 4.5.2 preserves the privacy of the honest
parties against the collusion, up to n − 1 corrupted parties.
4.5.5 Performance analysis
The communication overhead of this protocol is less than th e support count pre-
serving protocol. For evaluating th e efficiency of the protocol in practice, we
provide an experiment to evaluate the performance of the proposed protocols that
run in the C# language on a PC computer. As communication complexity dep en d s
on the network performance a n d physical d i st a n ce of two parties, we simply con-
sidered parties as threads that exchange data directly by shared memory method.
The size of parameters p and q are 1024 and 160. We measure the computation
cost of the protoco l for the number of the d i ffer en ce parties from 2 to 10. In thesis,
we illustrate our measurements of all parties’s computation time: it is in regard
to m and n, for a typical scenario, where m = 1000, n = 10. The computation
time of all parties is about 15.1 seconds.
4.6 Using binary tree communication structure
In [Vaidya and Clifton, 2003], the authors show that any function that can be
represented as y = f(x
1
, , x
n
) = x
1
⊗ x
2
⊗ ⊗ x
n
with being an associative
operation can be securely computed using a more efficient tree structure. In thesis,
we show that our protocols can be used in the binary tree communication structure.
4.7 Privacy-preserving distributed Apriori algorithm

Though, the main contribution in this chapter is two protocols for frequent itmset
identifying. In section, we showed that these protocols can be easily incorpo r a t ed
into the Apriori algorithm based on the general idea [Vaidya and Clifton, 2002,
Zhong, 2007], so it all ows the parties to cooperate for mining all frequent itemeset
in the joint data set, without disclosing each party’s privacy dat a .
4.8 Conclusion
In this chapter, we proposed the prot ocols for privacy-preservi n g frequent itemset
mining in vertically distributed data. We showed that the proposed protocols are
more secure than the most secured protocols with collusion resistance. Moreover,
the same holds even there are up to n−1 corrupted parties among n parties in the
protocol. In addition, we gave two protocols that allow the parties to be able to
select one of two privacy level correspond i n g to two protocols, one of them reveals
only the support count, and the other reveals nothing.
16
Chapter 5
PRIVACY PRESERVING CLUSTERING
5.1 Introduction
This work is firstly to develop a privacy preserving clust er i n g protocol for multi-
party model. Unlike the existing proto co l [Lin et al., 2005], our protocol allows
the number of partici p a t i n g parties to be ar b i t r a r y, moreover it does not reveal
numerators and denominators i n calculating th e parameters, ther efo r e, the par-
ties cannot learn extra information of the others. Secondly, we propose a better
protocol for the case in which the dataset is horizontally partitioned into onl y two
parts. This protocol requires protecting privacy of intermediat e global informatio n
in particular the intermediate candid a t e cluster centers.
5.2 Problem statement
The EM algo r i t h m presented in [Dempster et al., 1977]. Let D be a data set
that has m objects {x
1
, , x

m
} described by d attributes. Assume that there
exist k classes i n the data set D, each follows som e Gaussian distributio n . The
parameters of the class i are ψ
i
= {µ
i
, Σ
i
, π
i
}, i n which µ
i
is the center of the
Gaussian distribution, Σ
i
is the covarian ce matrix of the distribution and π
i
is
the probability of the class i. The EM algorith m is to estima t e the paramet er s
set ψ that maximize the da t a likelihood log(L(ψ)). To estimate ψ, it starts with
a randomly chosen initial parameter configuration ψ
0
. Then, it keeps invoking
iterations to recomp u t e ψ
t+1
based on ψ
t
. Every iteration consists of two steps.
E-step: Compute the expected value of z

ij
, where z
ij
is the p o st er i o r probab i l i ty
of x
j
from class i. M-step: Update the parameters ψ
t+1
. Convergence happens
when δ = |log(L(ψ
(t+1)
) − log(L( ψ
(t)
)| ≤ ǫ, where ǫ is a predeternined threshold.
Assume that the data set D is horizontally partitioned into n parties, each
party i hold the some of objects of D. Assume that the parties want to cluster the
joint data set. So, each party coul d learn the cluster to which each of their data
objects belongs, while no t h i n g abo u t objects and the local statistical paramet er s
of each party are revealed.
5.3 Privacy preserving clustering for multi-party model
The goal of t h e cluster algorithm is to compute z
ij
. To obtain z
ij
, each party needs
to know t h e covariance matrix Σ
i
, the vector of mean s µ
i
and π

i
in each it er a -
17
tion of the algorithm. These parameters can be presented by formulae: µ
(t+1)
i
=

n
l=1
A
il
/

n
l=1
C
il
; Σ
(t+1)
i
=

n
l=1
B
il
/

n

l=1
C
il
; π
(t+1)
i
=

n
l=1
C
il
/

n
l=1
m
l
. We
rewrite each pa r a m et er by A/B ca l l ed the mean value, where A =

n
i=1
x
i
,
B =

n
i=1

m
i
, and each (x
i
, y
i
) owned by the party P
i
. Therefore, the prob l em is
to compute the private multi-party mean. Assu m e that A and B belon g to the
range [0, M], the multi-party mean computation protocol presented in Figure 5.1
Input: Each P
i
(1 ≤ i ≤ n) has (x
i
, m
i
)
Output: The parties obtain the value A/B.
1: Each P
i
uniformly chooses r
i
and k
i
from [0, M], then computes u
i
= x
i
+ r

i
mod M and
v
i
= m
i
+ k
i
mod M, and sends u
i
and v
i
to P
n
2: P
n
computes u =

n
i=1
u
i
mod M and v =

n
i=1
v
i
mod M
3: Each P

i
randomly splits r
i
into n − i + 1 parts {r
ij
|j = i, , n}, and k
i
into n − i + 1 parts
{k
ij
|j = i, , n}. Then, P
i
sends r
ij
and k
ij
(j = i + 1, , n) to P
j
4: Each P
i
(1 < i ≤ n) computes r

i
= r
ii
+

i−1
j=1
r

ij
mod M and k

i
= k
ii
+

i−1
j=1
k
ij
mod M.
Next it sends r

i
and k

i
to P
1
5: P
1
computes r = r
11
+

n
i=2
r


i
mod M and k = k
11
+

n
i=2
k

i
mod M
6: P
1
and P
n
use the approximate ln(x) protocol given in [Lindell and Pinkas, 2000]. P
1
obtains
a
1
and b
1
, P
n
obtains a
n
and b
n
.

a
1
+ a
n
= α ln (u − r mod M) mod M
b
1
+ b
n
= α ln (v − k mod M) mod M
where α is a public constant used to make all elements integer
7: P
n
computes s
n
= b
n
− a
n
mod M and sends it to P
1
; P
1
computes s
1
= s
n
+ b
1
− a

1
mod M
and broadcasts it to all parties
8: Finally, all parties can calculate µ = exp(s
1
/α).
Figure 5.1: Privacy preserving multi-party mean computation
The multi-party mean computati o n protocol u sed to design the privacy pre-
serving EM-based clustering protocol as in Figure 5.2.
5.4 Privacy preserving clustering for two-party model
When dat a set is horizontally partitioned into only two parts. We decompose the
privacy preserving clust er i n g problem into th r ee following subproblems: secure
mean sharing, secure covariance matrix computation, and secure posterior proba-
bility com p u t a t i o n . We solve these problems by using the OPE and SPP protocols.
The most important problem is the secure mean sharing presented in Figure 5.4.
18
Input: Each party P
l
(l = 1, , n) has the data set D
l
with m
l
objects
Output: The cluster of each object
1: The parties randomly initializes z
ij
to 0 or 1 ( i = 1 k, j = 1 m
l
).
2: t:=0

3: while δ > ǫ do
4: for i = 1 k do
5: For each l ∈ {1, , n}, P
l
locally computes A
il
and C
il
.
6: The parties jointly compute µ
(t+1)
i
and π
(t+1)
i
by using Protoco l 5.1
7: For each l ∈ {1, , n}, P
l
locally computes B
il
.
8: The parties jointly compute Σ
(t+1)
i
by using Protoco l 5.1
9: For each l ∈ {1, , n}, P
l
locally computes z
ijl
.

10: end for
11: t = t + 1
12: The parties jointly compute δ = |log(L(ψ
(t+1)
) − log(L(ψ
(t)
)|
13: end while
Figure 5.2: Privacy preserving multi-party clustering
Input: Two parties, Alice and Bob have (n, x) and (m, y), respectively.
Output: Alice obtains r
1
, Bob obtains r
2
.
1: Alice uniformly chooses p from F and defines Q
1
(z) = pz + pn.
2: Alice and Bob privately evaluate Q
1
, Bob obtains b
1
= Q
1
(m) = pm + pn.
3: Bob randomly chooses q ∈ F and defines Q
2
(z) = yz − (pm + pn)q.
4: Alice and Bob privately evaluate Q
2

, Alice obtains a
1
= Q
2
(p) = py − (pn + pm)q.
5: Alice randomly chooses r ∈ F and defines Q
3
(z) = −rz + py + px − (pn + pm)q.
6: Alice and Bob privately evaluate Q
3
, Bob obtains b
2
= Q
3
(pn + pm) = −r(pn + pm) + py +
px − (pn + pm)q.
7: Alice has r
1
= r and Bob computes r
2
=
b
2
b
1
+ q = −r +
x+y
n+m
.
So, the respective outputs of Alice and Bob are r

1
and r
2
, giving us that r
1
+ r
2
=
x+y
n+m
.
Figure 5.4: Secure mean sharing
5.5 Conclusion
We have presented a preserves privacy EM-based clustering method for horizon-
tally distributed data. Firstly, a protocol for multi-party distr i b u t ed data pro-
posed. We showed that the proposed protocol is more secure than the previous
protocol withou t some extra informat i o n , and it allows the number of par t i ci p a t -
ing parties to be a r b i t r a r y. Secondly, we propose the better protoco l for two-party
case. This protocol al l ows computing final results without revealing the private
information and even the cluster centers.
19
Chapter 6
PRIVACY PRESERVING OUTLIER DETECTION
6.1 Introduction
Currently, there is a number of different definitions for outliers as well as techniques
to find them, only some currently developed methods in a privacy preserving
fashion for distance-based outliers detection . In addition, ther e are the statistic
techniques for detecting outliers [Alqallaf et al., 2002, Hodge and Austin, 2004],
but st i l l no work on fin d i n g them in a privacy preserving fashion. Thus, concerning
to this work, our contribution in this chapter is the development of a privacy

preserving solution for detecting outliers based on the stati st i c method.
6.2 Technical preliminaries
6.2.1 Problem statement
A data set X consists of N objects and n attributes that take values as real num-
bers, denote the sample mean vector of X by
X, the sample covariance ma t r i x of X
by C( X), and the i
th
row of X by X(i). Statistical methods [Hodge and Austin, 2004]
for multivariate outlier detection is to compute the Mahalanobis distance:
d
2
i
= (X(i) −
X)
T
C
−1
(X) (X(i) − X)
for i = 1, , N. A large distance indicates that observation is an ou t l i er for
predictor. Therefore, in order to detect outli er s, the main task is to effectively
compute C
−1
(X) and
X.
Assume that X is vertically or horizontally distributed on K part i es. The
target of this work is to find solutions t o conduct multiva r i a t e out l i er detection
according to the above described circumstances. So, parties could learn which
data objects are outlier while preservi n g information about each object and the
local statistical parameters.

6.2.2 Linear transformation
Let M be a n × n invertible matrix, each m
ij
entry of M takes a rando m value as
a real number, and Y = (Y
1
, Y
2
, , Y
n
) = XM is a random N × n matrix obtained
from a linear transformation of X somewh er e they said tha t mat r i x M.
Lemma 6.1. Let G(X) be the gram matrix of X, that is, G(X) = X
T
X. Then,
the gram matr ix of Y will be given by G(Y ) = M
T
G(X)M and the inverse matrix
of G(X) will be given by G
−1
(X) = MG
−1
(Y )M
T
20
6.2.3 Privacy model
The privacy of the proposed protocols in th i s chapter is based on both semi-honest
and extension models in [Du et al., 2004]. Currently, the extension model is still
considered as a heuristic model and theoret i ca l l y ; an a l y si s of thi s model is still
being investigated. In addition, this model is weaker in secure than semi-honest

model, but using this model can lead to sol u t i o n s that are much more efficient than
the solutions based on the SMC security model. Theoret i ca l l y, a protocol that
satisfies the extension model might still disclose significant information . However,
it doesn’t happen in the situations appl i ed in thi s chapter.
6.2.4 Private matrix product sharing
In this section, we present the pri vate matrix product sharing protocol [Du et al., 2004],
which used as a bu i l d i n g block to incorporati n g privacy preservation in the next
protocols.
Let A = (a
ij
)
m×q
and B = (b
ij
)
q×n
be two private mat r i ces of the parties Alice
and Bob, respectively. The goal of this protocol is to privately compute function:
(A, B) → (S
a
, S
b
)|S
a
+ S
b
= AB.
6.3 Protocols for the horizontally distributed data
Assume that the data set X is horizontally distributed on K parties, each party
k has a set X

k
of N
k
objects, note that N =

K
i=1
N
i
. Each element C
ij
of C(X)
is computed by
C
ij
=
g
ij
N − 1
=

K
k=1
g
(k)
ij

K
k=1
N

k
− 1
where g
(k)
ij
can be locally computed by P
k
.
In the next sections, we pr esent two protocols for privacy preservi n g outlier
detection in horizontally distributed data. Two important comp u t a t i o n works
need t o be imp l em ented that compute
X and C
−1
(X) . Here X can be directly
obtained by the multi-party divi si o n protocol as presented in the previous chapter
without disclosing raw data of each party. Consequently, basically, the main work
is to compute C
−1
(X) .
6.3.1 Two-party protocol
Assume t h a t the data set X is horizontally distributed on two parties, Alice and
Bob. Alice has a set X
1
of N
1
objects and Bob has a set X
2
of N
2
objects. The

protocol is given in Figure 6.2.
21
Input: Alice and Bob have the data set X
(1)
and X
(2)
, respectively.
Output: the Mahalonobise distance of each object
1. The parties use the secure sharing mean proto co l to compute
X.
2. The parties share the mat r i x C(X) by using the secure mean sharing protocol. Alice obtains
C
(1)
and Bob obtains C
(2)
.
3. Alice generates a random matrix M. Alice and Bob use PMPS to share C
(2)
M, Alice obtains
M
(1)
and Bob obtains M
(2)
.
4. Alice sends C
(1)
M + M
(1)
to Bob
5. Bob computes C(Y ) = C

(1)
M + M
(1)
+ M
(2)
, then computes C
−1
(Y ) and sends it to Alice.
6. Alice computes C
−1
(X) = MC
−1
(Y )M
T
7. Each party uses C
−1
(X) and
X to locally compute Mahalonobise distance for its every
object.
Figure 6.2: Protocol for two-party horizontally distributed data.
6.3.2 Multi-party protocol
The protocol is given in Figure 6.3.
6.4 Protocol for two-party vertically distributed data
Assume that the data set X is vertically distributed on two parties Alice and Bob,
where Alice has a subset X
1
and Bob has a subset X
2
. In order to detect outliers,
the first work is to compute C

−1
(X) and
X. Secondly, we use these parameters to
computing Mahalonobise dista n ce for out l i er d et ect i o n .
X can be directl y ob t a i n ed
by local computation, so it does not disclose raw data while computing
X. Thus,
the protocol consists of three works as follows.
1. Using PMPS to compute (
ˆ
X
1
,
ˆ
X
2
)→ (C
(1)
, C
(2)
)|C
(1)
+ C
(2)
= (
ˆ
X)
T
(
ˆ

X)
2. Using the linear transformation to compu t e (C
(1)
, C
(2)
) → C
−1
(X)
3. Using SPP protocol to compute the Mahalanob i se dist a n ces
6.5 Experiments
We provide an experiment to evaluate the performance of the proposed protocols
that run in the C# language on a PC computer. As communication complexity
depends on the network performa n ce and physical distance of two parties, we sim-
ply considered parties as threads that exchange data directly by shared memory
method. The dataset used is the Breast Cancer Dat a b a se from the UCI Machine
22
Input: K parties, each party i has the data set X
(i)
.
Output: the Mahalonobise distance of each object
1. The parties use the secure sum protocol to compute N =

K
i=1
N
i
2. Each party locally the matrix C
k
.
3. Party 1 generates a random matrix M, then each party i (i = 2, , K) and party 1 uses

PMPS protoco l to share C
(i)
M, party 1 ob t a i n s M
(1)
i
and party i obtains M
(2)
i
.
4. Party 1 comp u t es C
(1)
M +

K
i=1
M
(1)
i
, then the parties follows a communication round to
compute C(Y ) =

K
i=1
C
(i)
M. At t h e end, Party K obtains C(Y ).
5. Party K computes C
−1
(Y ) and sends it to Party 1.
6. Party 1 computes C

−1
(X) = MC
−1
(Y )M
T
and broadcasts this matrix to all other parties
7. Each party uses C
−1
(X) and
X to locally compute the Mahalonobise distance for i t s every
object.
Figure 6.3: Protocol for multi-party horizontally distributed data
Learning Dep o si t o r y. There are 5 6 9 data samples and 32 numeric attributes. In
the thesis, we illustrate ou r measurements of the computation time for hori zo n -
tally distribu t ed data, wh er e data instan ces of data set are uni fo r m l y distributed
between two parties: it is linear in N, and dependencies on n is very negligible.
For a typical scenario where n = 20 and N = 500, the computation time of the
protocol is about 10.47 seconds.
6.6 Conclusions
We have proposed a solution for privacy-preserving multivariate outlier detection
on both vertically and horizontally distributed da t a model on two parties, we
extend the solution for K-party horizo ntally dist r i b u t ed model. Our method is
based on techniques: linear transformation, private matrix product sharing, secure
mean computation and secure sum. We proved protocols’ s privacy based on both
semi-honest and extension models, and proved correctness of the protocol based
on Lemma 6.1. We provided the experiments to show our protocol is linear in the
number of data attributes and the size of database. Our soluti o n is very efficient in
horizontally distributed data that it mainly depend s on the number of attributes.
23
SUMMARY

This thesis have proposed four solutions for four problems in PPDM. In each
solution, we provided analysis to prove privacy and correctness based on the semi-
honest security model and th e secure multi-party computation methods. We also
evaluated the communication cost and computational complexity. In addition,
we p r ovided some experimental results to show how efficient and practical of t h e
solutions.
In the first work, we have proposed a solution which allows a miner to learn
the frequency-based data mi n i n g model s in 2PFD setting whi l e preserving each
users privacy. The cru ci a l step in the proposed solution is the privacy-preserving
computation of frequencies of tuple of values in the users data. We il lustrated t h e
applicability of the solution by using it to build the privacy preserving protocol for
the naive Bayes classifier l ea r n i n g . Experim ental results show that our protocol is
efficient.
In t h e second work, we have proposed two novel protocols for frequent itemset
mining in vertically distributed data: one of them reveals only the su p port count,
and the other reveals nothing. The important security property of our protocols is
better than the previous p r o t ocols’ one in the way that we achieve the full privacy
protection for each party. This property does not require the existence of any of
trusted part i es. In addition, no co l l u si o n of parties can make privacy breaches,
unless all parties together make a single collusion, which does not exist in fact.
In the third work, we have presented an EM-based clustering method for hor-
izontally distributed data that preserves privacy for data of particip a t i n g par t i es.
Firstly, a protocol for multi-party mo d el was proposed. The proposed p r o t ocol is
more secure than the previous pro t ocol and it allows the number of participating
parties to be arbitrary. Secondly, we proposed the better protocol for the case
in which the dataset is hori zo ntally partitioned into only two parts, this protocol
allows computing final results without revealing privacy and even cluster centers.
In the forth work, we have provided a so l u t i o n for privacy-preserving mul-
tivariate outlier d et ect i o n on both vertically and horizontally distributed data
models. Basical l y, the proposed solution is based on linear transformation, the

PMPS protocol and the secure mean sharing protocol. Privacy of the protocols
in the solution is validated based on bot h Semi-honest and extension model s. In
addition, we provided exper i m ents to show the efficiency of the protocols.
24

×