Tải bản đầy đủ (.pdf) (130 trang)

distributed solutions in privacy preserving data mining

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (593.48 KB, 130 trang )

, mn

B GIÁO DC VÀ ÀO TO B QUC PHÒNG
VIN KHOA HC VÀ CÔNG NGH QUÂN S
e
̌f



LNG TH DNG




DISTRIBUTED SOLUTIONS IN PRIVACY
PRESERVING DATA MINING
(Nghiên cu xây dng mt s gii pháp đm bo an toàn
thông tin trong quá trình khai phá d liu)







LUN ÁN TIN S TOÁN HC










Hà N

i - 2011


B GIÁO DC VÀ ÀO TO B QUC PHÒNG
VIN KHOA HC VÀ CÔNG NGH QUÂN S
ěf




LNG TH DNG




DISTRIBUTED SOLUTIONS IN PRIVACY
PRESERVING DATA MINING
(Nghiên cu xây dng mt s gii pháp đm bo an toàn
thông tin trong quá trình khai phá d liu)

Chuyên ngành: Bo đm toán hc cho máy tính và h thng tính toán.
Mã s : 62 46 35 01




LUN ÁN TIN S TOÁN HC


Ngi hng dn khoa hc:
1. GIÁO S - TIN S KHOA HC H TÚ BO
2. PHÓ GIÁO S - TIN S BCH NHT HNG




Hà Ni - 2011
Pledge
I promise that this thesis is a presentation of my ori gi n al research work.
Any of the content was written based on the reliable references such as
published papers in distinguished international conferences and journals, and
books published by widely-known publishers. Results and discussions of the
thesis are new, not previously published by any other authors.
i
Contents
1 INTRODUCTION 1
1.1 Privacy-preserving data mining: An overview . . . . . . . . . 1
1.2 Objectives and contributions . . . . . . . . . . . . . . . . . . 5
1.3 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Organization of thesis . . . . . . . . . . . . . . . . . . . . . . 12
2 METHODS FOR SECURE MULTI-PARTY COMPUTATION 13
2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.1 Computational indistinguishability . . . . . . . . . . . 13
2.1.2 Secure multi-party computation . . . . . . . . . . . . . 14
2.2 Secure computation . . . . . . . . . . . . . . . . . . . . . . . 15

2.2.1 Secret sharing . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.2 Secure sum computation . . . . . . . . . . . . . . . . . 16
2.2.3 Probabilistic public key cryptosystems . . . . . . . . . 17
2.2.4 Variant ElGamal Cryptosystem . . . . . . . . . . . . . 18
2.2.5 Oblivious polynomial evaluation . . . . . . . . . . . . 20
2.2.6 Secure scalar product computation . . . . . . . . . . . 21
2.2.7 Privately computing ln x . . . . . . . . . . . . . . . . . 22
3 PRIVACY PRESERVING FREQUENCY-BASED LEARNING IN 2PFD SETTING 24
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Privacy preserving frequency mining in 2PFD setting . . . . . 27
3.2.1 Problem formulation . . . . . . . . . . . . . . . . . . 27
3.2.2 Definition of privacy . . . . . . . . . . . . . . . . . . . 29
3.2.3 Frequency mining protocol . . . . . . . . . . . . . . . 30
ii
3.2.4 Correctness Analysis . . . . . . . . . . . . . . . . . . . 32
3.2.5 Privacy Analysis . . . . . . . . . . . . . . . . . . . . . 34
3.2.6 Efficiency of frequency mining protocol . . . . . . . . 37
3.3 Privacy Preserving Frequ en cy -b ased Learning in 2PFD Setting 38
3.3.1 Naive Bayes learning problem in 2PFD setting . . . . 38
3.3.2 Naive Bayes learning Protocol . . . . . . . . . . . . . . 40
3.3.3 Correctness and privacy analysis . . . . . . . . . . . . 42
3.3.4 Efficiency of naive Bayes learning protocol . . . . . . . 42
3.4 An improvement of frequency mining protocol . . . . . . . . . 44
3.4.1 Improved frequency mining protocol . . . . . . . . . . 44
3.4.2 Protocol Analysis . . . . . . . . . . . . . . . . . . . . . 45
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4 ENHANCING PRIVACY FOR FREQUENT ITEMSET MINING IN VERTICALLY 49
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . 51
4.2.1 Association rules and frequent itemset . . . . . . . . . 51

4.2.2 Frequent itmeset identifyin g in vertically distributed data 52
4.3 Computational and privacy model . . . . . . . . . . . . . . . 53
4.4 Support count preserving protocol . . . . . . . . . . . . . . . 54
4.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.4.2 Protocol design . . . . . . . . . . . . . . . . . . . . . . 56
4.4.3 Correctness Analysis . . . . . . . . . . . . . . . . . . . 57
4.4.4 Privacy Analysis . . . . . . . . . . . . . . . . . . . . . 59
4.4.5 Performance analysis . . . . . . . . . . . . . . . . . . . 61
4.5 Support count computation-based protocol . . . . . . . . . . 64
4.5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.5.2 Protocol Design . . . . . . . . . . . . . . . . . . . . . . 65
4.5.3 Correctness Analysis . . . . . . . . . . . . . . . . . . . 65
4.5.4 Privacy Analysis . . . . . . . . . . . . . . . . . . . . . 67
4.5.5 Performance analysis . . . . . . . . . . . . . . . . . . . 68
4.6 Using binary tree communication structure . . . . . . . . . . 69
iii
4.7 Privacy-preserving distributed Apriori algorithm . . . . . . . 70
4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5 PRIVACY PRESERVING CLUSTERING 73
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . 74
5.3 Privacy preserving clustering for the multi-party distributed data 76
5.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.3.2 Private multi-party mean computation . . . . . . . . . 78
5.3.3 Privacy preserving multi-party clustering protocol . . 80
5.4 Privacy preserving clustering without disclosing cluster centers 82
5.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.4.2 Privacy preserving two-party clustering protocol . . . 85
5.4.3 Secure mean sharing . . . . . . . . . . . . . . . . . . . 87
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

6 PRIVACY PRESERVING OUTLIER DETECTION 91
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.2 Technical prelimi n ar i es . . . . . . . . . . . . . . . . . . . . . . 92
6.2.1 Problem statement . . . . . . . . . . . . . . . . . . . . 92
6.2.2 Linear transformation . . . . . . . . . . . . . . . . . . 93
6.2.3 Privacy model . . . . . . . . . . . . . . . . . . . . . . 94
6.2.4 Private matrix product sharing . . . . . . . . . . . . . 95
6.3 Protocols for the horizontally distributed data . . . . . . . . . 95
6.3.1 Two-party protocol . . . . . . . . . . . . . . . . . . . . 97
6.3.2 Multi-party protocol . . . . . . . . . . . . . . . . . . . 100
6.4 Protocol for two-party vertically distributed data . . . . . . . 101
6.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
SUMMARY 107
Publication List 110
Bibliography 111
iv
List of Phrases
Abbreviation Full name
PPDM Privacy Preserving Data Mining
k-NN k-nearest neighbor
EM Expectation-maximization
SMC Secure Multiparty Computation
DDH Decisional Diffie-Hellman
PMPS Private Matrices Product Sharing
SSP Secure Scalar Product
OPE Oblivious polynomial evaluation
ICA Independent Component Analysis
2PFD 2-part fully distributed setting
FD fully distributed setting

c
≡ computational indistinguishability
v
List of Tables
4.1 The communication cost . . . . . . . . . . . . . . . . . . . . 62
4.2 The complexity of the support count preserving protocol . . . 63
4.3 The parties’s time for the support count preserving protocol 64
4.4 The communication cost . . . . . . . . . . . . . . . . . . . . 68
4.5 The complexity of the support count computation protocol . 69
4.6 The parties’s time for the support count computation protocol 70
6.1 The parties’s computational time for the horizontally distributed data 105
6.2 The parties’s computational time for the vertically distributed data 105
vi
List of Figures
3.1 Frequency mining protocol . . . . . . . . . . . . . . . . . . . . 33
3.2 The time used by the miner for computing the frequency f . 38
3.3 Privacy preserving protocol of naive Bayes learning . . . . . . 41
3.4 The computational time for the first phase and the third phrase 43
3.5 The time for computing the key values in the first phase . . 43
3.6 The time for computing the frequency f in third phrase . . . 44
3.7 Improved frequency mining protocol . . . . . . . . . . . . . . 47
4.1 Support count preserving protocol. . . . . . . . . . . . . . . . 58
4.2 The support count computation protocol. . . . . . . . . . . . 66
4.3 Privacy-preserving distributed Apriori protocol . . . . . . . . 72
5.1 Privacy preserving multi-party mean computation . . . . . . 79
5.2 Privacy preserving multi-party clustering protocol . . . . . . 81
5.3 Privacy preserving two-party clustering . . . . . . . . . . . . 86
5.4 Secure mean sharing . . . . . . . . . . . . . . . . . . . . . . . 89
6.1 Private matrix product sharing (PMPS). . . . . . . . . . . . . 96
6.2 Protocol for two-party horizontally distributed data. . . . . . 98

6.3 Protocol for multi-party horizontally distributed data. . . . . 101
6.4 Protocol for two-party vertically distributed data. . . . . . . . 103
vii
Chapter 1
INTRODUC TI O N
1.1. Privacy-preserving data mining : An overview
Data mining plays an important role in the current world and provides
us a powerful tool to efficiently discover valuable information from large
databases [25]. However, the process of mining data can result in a viola-
tion of privacy, therefore, issues of privacy preservation in data mining are
receiving more and more attention from the this community [52]. As a re-
sult, there are a large number of studies has been produced on the topic of
privacy-preserving data minin g (PPDM) [72]. These studies deal wi t h the
problem of learning data mining models from the databases, while protecting
data privacy at the level of individual records or the level of organizations.
Basically, there are three major problems in PPDM [8]. First, the organi-
zations such as government agencies wish to publish their data for research er s
and even community. However, they want to preserve the d at a privacy, for
example, highly sensitive financial and health private data. Second, a group
of the organizations (or parties) wishes to together obtain the mining re-
sult on their joint data without disclosing each party’s privacy information.
Third, a miner wishes to collect data or obtai n the data mining models from
the individual users, while preserv i n g privacy of each user. Consequently,
PPDM can be formed into three following areas depending on the models of
information sharing.
Privacy-preserving data publishing: The model of this resear ch con-
sists of only an organization, is the trusted data holder. This organization
wishes to publish its d at a to the mi n er or the r esear ch community such that
the an onymized data are useful for the data mining applications. For exam-
ple, some hospitals collect records from their patients for the some required

1
medical services. These hospital can be the trusted data holder, however the
patients may not trust the hospital when they send their data to the miner.
When we publish our anonymized data there are many evidences show
that the publ i sh i n g data may make privacy breaches via some attacks. One of
them is called re-identification and as showed in [61]. For example, there are
87% of the American population has charact er i st i cs that allows identifying
them uniq u el y based on several public attribut es, namely zip code, date of
birth, and sex. Conseq u ently, privacy preserving data publishing has received
many attentions in recent years that aims to prevent r e-i d entification attack,
while preserving the useful information for data mining applications from
the released data.
The general technique called k-anonymity [51], [82],[6], the goal here is
to make the proper ty that obtains the protection of released data to fight
against re-identification possibility of the released data. Consider a pri vate
data table, where the data have been removed explicit identifiers (e.g., SSN
and Name). However, values of other released attributes, such as ZIP, Date
of birth, Mar i t al stat u s, and Sex can also appear in some external sources
that may still joint with the i n d i v i d u al users’ identities. If some combinati on s
of values for these attributes occu r uniquely or r ar el y, then from observing
the data one can determine the identity of the u ser or deduce a limited set
that consists of user. The goal of k-anonymity is that every tuple in the
released private table is indistinguishability from at least other k user s.
Privacy-preserving distributed data mining: This research area
aims to develop distributed data mining algorithms without accessing orig-
inal data [33, 79, 35, 68, 80, 40]. Different from privacy preserving data
publishing, each study in privacy-preserving distr i b u t ed data mining is often
to solve a specific data mining task. The model of this area usually consists
of several parties instead, each party has one private data set. The gen-
eral purpose is to enable the parties for mining cooperatively on their joint

data sets without revealing private information to other participating par-
ties. Here, t h e way the data is distributed on parties al so p l ays an i mportant
role in th e solved problem. Generally, data could be distributed into many
2
parts either vertically or horizontally.
In horizontally distribution, a data set is distributed into several parties.
Every party has data records with the same set of attribut es. For example,
the customer databases union of the different banks. Typically, banks have
different services for their clients as a savi n gs account, choice of credit card,
stock investments, etc. Assuming that banks wish to predict who are safe
customers, who may be risk ones and may be frauds. Gathering all kinds of
financial data about their customers and th ei r transactions can help them
in the above predictions, thus they prevent huge financial losses. Using
reasonable techniques for mining on the gathered data can general i ze over
these datasets and identify possible risks for fut u r e cases or transactions.
More specifically, when a customer Nam goes to Ban k A to apply for a loan
to buy a car. He needs to provide the necessary informat i on to the Bank A.
An expert system of Bank A can use k-NN algorithm to classify Nam as either
a risk or safe customer. If t h i s syst em on l y uses the database of the bank
A, it could happen that Bank A does not have enough customers t h at are
close to Nam. Therefore, t h e system may produce a wr on g classification. For
example, Nam is one safe customer but the system recognized him as a risk
customer. Consequently, the Bank A has a loss of profi t . It is clear that the
mining on the larger database can result in the more accurate classification.
Thus, the cl assi fi cat i on on the joint databases of bank A and other banks
might give a more accurate result and Nam could have b een classified as
safe. However, the problem is that privacy restrictions would not allow the
banks to access to each other’s databases. So, privacy-preserving distributed
data mining can help this problem. This scenario is a typical case of so-
called horizontally partitioned data. In the context of p r i vacy-pr eser v i n g

data mining, banks do not need to reveal their d at ab ases to each other.
They can still apply k-NN classification to the joint databases of banks while
preserving each bank’s privacy information.
In vertically distribution , a data set is dist r i b u t ed into some parties.
Every party owns a vertical part of every record in the database (it h ol d s
records for a subset of the attributes). For example, financial transacti on
3
information is collect ed by banks, while the tax inf or mat i on for everyone
is collected by the IRS. In [71] Jaideep Vaidya et al. show an illustrative
example of two vertical distr i b u t ed databases as well, one contains medical
records of people while another contains cell phone information for the same
set of people. M i n i n g the joint global database might obtain information like
Cell phones with Li/Ion batteries lead to brain tumors in diabetics.
Privacy-preserving user data mining: This research involves a sce-
nario in which a data miner surveys a large number of users to learn some
data mining r esu l t s based on the user data or coll ect s the user data while
the sensitive attributes of these users need to be protected [74, 77, 19]. In
this scenario, each user only maintains a data record. This can be thought
of a horizontally partitioned database in which each transaction is owned by
a different user, and it is called fully distributed setting (FD) as well. Unlike
privacy-preserving data publishing, the miner is different from the publisher
that he is un-trusted, thus he could be an attacker who att emp t s to identify
some sensitive information from the user data. For example, Du et al. [19]
studied to buil d decision tree on private data. In this study, a miner wants
to collect data from users, and form a central database, then to conduct data
mining on th i s database. Thus, he gives a survey containing some questions,
each user is requit ed to answer those questions and sends back the answers.
However, the survey contains some sensitive questions, and user may not
feels comfortable to disclose their answers. Thus, the problem is that how
could the miner obtain the mining model without learning sensiti ve i n f or ma-

tion about the users. One of requirement for the methods in t h i s area is th at
there are not any interactions between users, and each user only communi-
cate to the data min er . However, we ar e still able to ensure that nothing
about the sensitive data beyond the desired results is revealed to the data
miner.
4
1.2. Objectives and contributions
Up to now, there are many available solutions for solving the issues in
PPDM. The quality of each solution is evaluated based on the three basic
characteristics: privacy degree, accuracy, and efficiency. But the problem
here is that each solution was only used in a par t i cu l ar distributed scenario
or in a concrete data mining algorit h m. Although some of them can be
applied for more than one scenario or algorithm but their accuracy is lower
than acceptabl e requirement. Other solutions reach the accuracy, however,
their privacy is poor. In addition, it is easi l y to see t h at the lack of PPDM
solutions for various practical context as well as well-known data mining
techniques. In this th esi s, we aim at solving some issues in PPDM as follows:
1. To introduce a new scenario for privacy-preserving user data mining
and find a good privacy solution for a family of frequency-based learning
algorithms in this scenario.
2. To develop novel privacy-preserving techniques for popu l ar data min-
ing algorithms such as associat i on rule minin g and clustering methods.
3. To present a technique to design protocols for privacy- preserving
multivariate outl i er detection in both horizontal l y and vertically distrib u t ed
data models.
The developed solution s will be evaluated in terms of the degree of privacy
protection, correctness, usability to the real life applications, efficiency and
scalability.
The contributions of this thesis is to provide solutions for four problems
in PPDM. Each problem has an independent statement t o t h e oth er s, but

they share a common interpretation that given a data set being distributed
into several parties (or users), our task is to mine knowledge from all parties’
joint data while preserving privacy of each party. The difference among those
problems lies in variou s distributed data models (scenarios), and various
proposed functions to keep the privacy inf or mat i on for parties. Summarizing,
our contributions in this thesis are as follows.
5
• The first work (Chapter 3), we propose a new scenario for privacy-
preserving user data mining called 2-part fully distributed setting (2PFD)
and find solution for a family of frequency-based learning algori t h ms
in 2PFD setting. In 2PFD, the dataset is distributed across a large
number of users in which each record is owned by two different users,
one user only knows the values for a subset of att r i b u t es and the other
knows the values for the remaining attributes. A miner aims to learn ,
for example, classification rules on their data, while preserving each
users privacy. In this work we develop a cryptographic solution for
frequency-based learnin g methods in 2PFD. The cr u ci al step in the
proposed solution is the pri vacy-pr eser v i n g compu t at i on of frequencies
of a tuple of values in the users data, which can ensure each users pri-
vacy without loss of accuracy. We illustrate the applicability of the
method by using it to build the privacy preserving protocol for the
naive Bayes classifier l ear n i n g, and briefly ad d r ess t h e sol u t i on i n other
applications. Experimental results show that our protocol is efficient
• The second contribution of this thesis (Chapter 4) is the novel protocols
for privacy-preserving frequent itemset mining in vertically distributed
data. These protocols allow a group of parties cooperatively mine
frequent itemsets in distributed setting without revealing each party’s
portion of the data to the other. The important security property of
our protocols is better than the previ ou s prot ocols’ one in the way that
we achieve the full privacy protecti on for each party. This property

does not require the existence of any of trusted parties. In addition,
no collusion of parties can make privacy breaches.
• For third work (Chapter 5), we present the ex pectation maximization
mixture model clustering method for distr i b u t ed data that pr eser ves
privacy for data of participating parties. Firstly, privacy preserving
EM-based clusterin g metho d for multi-party distrib u t ed data proposed.
Unlike the existing method, our method does not reveal sum results
of numerator and denomin at or in the secure computation for the pa-
6
rameters of EM algorithm, therefor e, the proposed method is more
secure and it allows the number of participating parties to be arbi -
trary. Secondly, we propose the better method for the case in which
the dataset is horizontally partitioned into only two part s, this met h od
allows comput i n g covariance matrices and fi n al results without reveal-
ing the private information and the means. To solve this one, we have
presented a protocol based on the oblivious polynomial evaluation and
the secure scalar product for addressing some problems, such as the
means, covariance matrix and posterior probability computation. The
approach of paper allows two or many parties to cooperatively con-
duct clu st er i n g on their joint data sets without disclosing each partys
private data to the other.
• In fourth work (chapter 6), we study some parties - each has a private
data set - want to conduct the outlier detection on their joint data
set, but none of t h em want to disclose its private data to the other
parties. We propose a li n ear transformation technique to design pro-
tocols of secure multivariate outlier detection in both horizontally and
vertically distributed data models. While different from t h e most of
previous techniques in a privacy preservi n g fashion for distance-b ased
outliers detection. Our focus is other non-distance based techniques
for detecting outliers in statistics.

1.3. Related work s
Recently a lot of solutions have been proposed for PPDM. Those sol u t i on s
can be categorized into two mai n approach es: Secure multiparty computation
(SMC) approach and randomization approach.
The basic idea of randomization approach is to pert u r b the original (pri-
vate) dataset and the result is released for data analysi s. The perturbati on
has to ensure that original ind i v i d u al data values cannot be recovered, while
preserving the ut i l i ty of the data for statistical properties. Thus it all ows
7
patterns in the original data to be mined. There are two main perturbation
techniques: random transformation and rand omi zat i on . First transforms
each data value (record) into a random value (record) of the same domain
with the original data in ways t h at preserve certain statistics, but hide real
values [21, 4, 19, 3] . Second adds noise to data to pr event discovery of the
real. Randomization has to ensure that given the distribution of the noise
added to the data, and the randomized data set, it can be reconstructed the
distribution (but not actual data values) of the data set [1, 36, 16].
The typical example is the algorithm of Agrawal-Srikant [2], which values
of an attri b u t e are discretized into intervals and each original value is assigned
to an interval. Then, the original data distribution is reconstructed by a
Bayesian approach, and based on the reconstructed distribution the decision
trees can be induced. Many other distribution reconstruction methods have
also been introduced. In [1], Agrawal et al. developed an approach based
on Expectation Maximization that also gave a better definit i on of privacy,
and an improved algorith m. Evmievski et al. [21] used a si mi l ar techn i q u e
for association rules mini n g. Pol at et al. proposed a privacy preserving
collaborative filtering method using randomized techniques [53], etc.
Although pertur b at i on techniques are very efficient, their use generally
involves a tradeoff between privacy and accuracy, if we require the more
privacy, the miner loses more accuracy in the data min i n g results, and vice-

versa. Even the very technique that allow us to reconstruct distributions also
reveal in f or mat i on about the original data values. For exampl e, consider the
case of ran d omi zi n g age attribute. In principle, there are no drivers under
18 in the general distribution of age. Thu s, assume that randomization
is implemented by adding noise randomly chosen from the range [-10,10].
Although the reconstru ct ed distribution does not show us any true age value,
it only give the age of a driver be 40 years old that corresponds to the
true age in th e range [30 50]. However if an age value in the randomized
set is 7, we know that no drivers are under the age of 18, so the driver
whose age is given as 7 in the randomized dat a must be 18 years old in the
original data. Thus, some works has been don e to measure the privacy of
8
randomization techniques for the purpose that they must be used carefully
to obtain the desired privacy. Kargupta et al. [36] formally analyze the
privacy of randomization techniques and sh ow that many cases reveal privacy
information. Evmievski et al. [20] show how to l i mi t privacy breaches when
using the randomization technique for privacy preserving data mining.
Many privacy preserving data mining algorithm based on SMC has pro-
posed as well. They can be described as a computation al pro cess where
a group of parties computes a function based on private inputs, but nei-
ther party wants to disclose its own input to any other party. The secure
multiparty computation framework was d eveloped by Goldreich[24]. In t h i s
framework, mul t i p ar ty protocols can fall into either the semi-honest model
or malicious adversary model. In the semi-honest model, the parties are
assumed that follows the protocol rules, but after the execution of th e p r o-
tocol has completed, t h e parties still try to learn addi t i on al information by
analyzing the messages they received during the execution of the protocol.
In the malici ou s adversary model, it is assumed that the par t i es can execute
some arbitr ar y operations to damage to other parties. Thus, the protocol
design in this model are much more d i fficu l t than in the semi-honest model.

However, in current the semi-honest mod el is usual l y used for the context of
privacy pr eser v i n g data mining. The formal definitions of SMC was stated
in [24].
The secure multi-party computation problem was first proposed by Yao
[78] where he gave the method to solve Yao’s Millionaire problem that allows
comparing the worth of two millionai r es without reveal i n g any pri vacy in-
formation of each people. Accordi n g to theoretical studies of Goldreich, t h e
general SMC prob l em can be solved by the circui t evaluation method. How-
ever, using this solution is not practical in terms of the efficiency. Therefore,
finding efficient problems specific solutions was seen as an important research
direction. In th e recent years, many specific solutions were introduced for
the different research areas such as information retrieval, computational ge-
ometry, statistical analysis, etc. [17, 10, 70].
Randomization approaches [21, 4, 19, 3, 1, 36, 16] can be used in the
9
fully distri b u t ed scenario where a data miner wants to obtain classification
models from the d at a of l ar ge sets of users. These users can simply random-
ize their data and then submit their randomized data to the miner who can
later reconstruct some useful information. However for obtaining strong pri-
vacy without loss of accuracy, in [74, 77] the SMC techniques have proposed.
The key idea of these techniques is a private frequency comput at i on method
that allows a data miner to compute frequencies of values or tuples in the
FD setting, whil e preserving privacy of each user’s data. In Chapter 3, we
proposed a SMC solution which all ows the miner to learn frequency-based
models in 2PFD setting. Note that in this setting, each user may only know
some values of t h e tu p l e b u t n ot all. Therefore, the above mentioned cryp-
tographic ap p r oaches can not be used i n 2PFD setting. In the FD setting,
other solutions based on k-anonymization of user’ s data have been proposed
in [83, 77]. The advantage of these solutions is that they do not depend on
the underlying data mining tasks, because the anonymous data can be used

for various data mining tasks without disclosing privacy. However, these so-
lutions are in ap p l i cab l e in 2PFD setting, because the miner can not link two
anonymous parts of one object with each other.
The SMC approach es are usually are used for privacy-preserving dis-
tributed data minin g as well, where data are distributed across several par-
ties. Thus, t h e privacy property of privacy-preserving distributed d at a min-
ing algorithms is quantified by the privacy definition of SMC, where each
party involved in the privacy-preserving distributed protocols is only allowed
to learn the desired data mining models without any other informat i on . Gen-
erally, each protocol has t o be designed for specific work for the reason of
efficiency and privacy. Currently specific privacy-preserving distribu t ed pro-
tocols have been proposed to address different data mining problems across
distributed databases, e.g., in [70, 63, 18], they d eveloped a privacy p r eser v -
ing classificati on protocols from the vertically distributed data based on a
secure scalar product meth od that consists of privacy preserving protocols
for learning naive Bayes cl assi fi cat i on , association rules and decision trees. In
[34], the privacy pr eser v i n g naive Bayes lassification was addressed for hori-
10
zontally distributed data by computing the secure sum of all local frequencies
of participating par t i es. O u r work in Chapter 4 is to present the frequent
itemset minin g protocols for vertically parti t i on ed data. Distributed associa-
tion rules/itemsets mi n i n g has been addressed for both vertically partitioned
data and horizontally partition ed dat a [33, 79, 35, 68, 80] . However, to the
best of our knowledge, these protocols p r eser ves the privacy of each party
and only resist the collusion at most n − 2 corrupted parties. Our protocols
for privacy pr eser v i n g frequent mining involving multiple parties can protect
the privacy of each par ty and gainst the collusion, up to n − 1 corrupted
parties.
For the related work with pri vacy preserving distr i b u t ed clustering. Re-
cently, privacy preserving cl u st er i n g problems have also been studied by

many authors. In [49] and [47], the authors focused on different transfor-
mation techniques that enable the data owner t o share the data with the
other party who will cluster it . Clifton and Vaidya proposed a secure multi-
party computation of k-means algorithm on vertically partitioned data [66].
In [29], the authors proposed a solution for privacy preserving clustering on
horizontally partitioned d at a, where they primarily focused on hierarchical
clustering methods that can both discover clusters of arbitrary shapes and
deal with different dat a types. In [59], Kruger et al. proposed a privacy pre-
serving, distributed k-means protocol on horizontally partition ed data that
the key step is privacy preserving of cluster means. At each iteration of the
algorithm, only means are revealed to parties wi t h ou t other things. But, re-
vealing means might allow t h e parties to learn some extra information of each
other. To our knowledge, there is so far only one secure method for the ex -
pectation maximization (EM) mixture model from horizontally distributed
sources [40] b ased on secure sum computation. However, this method re-
quires at least three participating parties. Because the global model is a
sum of local models, in case only two parti es, which often happens in prac-
tice, each party could compute other party’s local model by subtracting its
local mo d el from the global model . The aim of this work in chapter 5 is
firstly to develop a more general protocol which allows the number of par-
11
ticipating parties to be arb i t r ar y and more secure. Secondly, we propose a
better method for the case in which the dataset is horizontally partition ed
into only two parts.
For privacy preserving outlier detection, while there is a number of dif-
ferent d efi n i t i on s for outliers as well as techniques to find them, only some
currently developed methods in a privacy preserving fashion for distance-
based outliers detection. There are other non-distance based techniques for
detecting outliers in statistics [12], but there are sti l l no work on finding
them in a privacy preserving fashion [3]. The Mahalanobis distance has

been used in several wor k for outlier detection [10], [12]. In chapter 6, we
proposed solutions of privacy preserving outlier detection in both vertically
and horizontally distributed d at a. Our work related to the work of secure
sound classification in [57] in which the gaussian single model are used for
classification. However, this work solved the scenario of the two-party se-
cure classification. In this scenario, the parties engage in a protocol that
allows one party to classify her data using other’s classifier withou t revealing
anything her private information. Also, she will learn n ot h i n g about the
classifier. Therefore, our purpose and method is different from th i s work.
1.4. Organization of thesis
The thesis consists of six chapters, 109 pages of A4. Chapter 1 presents
an overview of PPDM and related works. Chapter 2 p r esents the basic def-
initions of secure multi-party computation and the techniques I frequently
use. Chapter 3 proposes privacy preserving frequency-based learning algo-
rithms in 2PFD. Chapter 4 presents two privacy-preserving algorithms for
distributed minin g of frequent itemsets. Chapter 5 discu sses privacy preserv-
ing EM-based clustering protocols. Chapter 6 presents privacy preserving
outlier detection for both vertically distributed data and horizontally dis-
tributed data. The summary of this thesis is presented in the last section.
12
Chapter 2
METHODS FO R SECURE MULTI-PARTY
COMPUTATION
In this thesis, we use secur e multi-party computation (SMC) and crypto-
graphic tools as the building blocks to d esi gn privacy-preserving data mining
protocols. Before discussing in d et ai l s, in th i s chapter, we first review some
important definitions of SMC. Then, we summarize the techniques which
will be used in the next chapters.
2.1. Definitions
2.1.1. Computational indistingui s ha bi l i ty

In thi s section, we review basic definitions from computational complexity
theory that will be used in this thesis [24].
The following is the standard definition of a negligible function.
Definition 2.1. Let N be the set of natural numbers. We say the function
ǫ(·) : N → ( 0, 1] is negligible in n, if for every positive integer polynomial
poly(·) there exis ts an integer n
0
> 0 such that for all n > n
0
ǫ(n) <
1
poly(n)
The computational indistinguishability is another important concept when
discussing the security properti es of distributed protocols [24]. Let X =
{X
n
}
n∈N
is an ensemble indexed by a security parameter n (which usually
refers to the length of the input), where the X

i
s are random variables.
Definition 2.2. Two ensembles, X = {X
n
}
n∈N
and Y = {Y
n
}

n∈N
, are
computational indistinguishable in polynomial time if for every probabilis ti c
13
polynomial time algorithm A,
|P r(A(X
n
) = 1) − P r(A(Y
n
) = 1)|
is a negligible f u ncti on in n. In such case, we write X
c
≡ Y , where
c
≡ denotes
computational indistinguishability.
2.1.2. Secure multi-pa rty computation
This section reviews the secure multiparty computation f r amework de-
veloped by Goldreich[24].
Secure multiparty computation function
In a distributed network with n participating parties. A secure n-party
computation problem can generally be considered as a comp u t at i on of a
function:
f(x
1
, x
2
, , x
n
) → (f

1
(x
1
, x
2
, , x
n
), , f
n
(x
1
, x
2
, , x
n
))
where each party i knows only its p r i vate input x
i
. For security, it is re-
quired that the p r i vacy of any honest party’s input is protected, in the
sense th at each dishonest party i learns n ot h i n g except it s own output
y
i
= f
i
(x
1
, x
2
, , x

n
). If there is any malici ou s party that may deviate from
the protocol, it is also required th at each honest party get a correct result
whenever possible.
Privacy in Semi-honest model
In the distribut ed setting, let π be an n-party protocol for comput -
ing f. Let
x denote (x
1
, , x
n
). The view of the i
th
(i ∈ [1, n]) party
during an execution of π on
x is denoted by view
π
(x) which includes x
i
,
all recei ved messages, and all internal coin flips. For every subset I of
[1, n], namely I = {i
1
, , i
t
}, let f
I
(
x) denot e (y
i

1
, , y
i
t
) and view
π
I
(x) =
(I, view
π
i
1
(
x), , view
π
i
t
(
x)). Let OUT P UT (x) denotes the output of all par-
ties during the execution of π.
14
Definition 2.3. An n-party computation protocol π for computing f(., , .)
is secure with respect to semi-honest parties, if there exists a probabilistic
polynomial-time algorithm denoted by S, such that f or every I ⊂ [1, n] we
have
{S(x
i
1
, , x
i

t
, f
I
(x)), f(x))}
c
≡ {view
π
I
(x), OUT PU T (x)}
This d efi n i t i on stat es that th e view of the parties in I can be simulated
from only the parties’ inputs and outputs. If the function is privately com-
puted by the protocol, then privacy of each party’ s input data is protected.
In this thesis, we focus on design i n g privacy-preserving protocols in the semi-
honest model. The formal definition of the security protocol in the malicious
model can be found in [24].
In this thesis, we also use compositi on theorem for the semi-honest model
that its discussion and proof can be found in [24]. The composition theorem
states that a protocol can be decomposed into several sub-pr ot ocols, then
security of the protocol will be proved if we can show that its subpr ot ocols
are secure.
Theorem 2.1 (Composit i on th eor em). Suppose that g is privately reducible
to f, and that there exists a protocol for privately computing f . Then there
exists a protocol for privately computi ng g.
2.2. Secure computation
2.2.1. Secret sharing
Secret sharing refers to any method by which a secret can be shared by
multiple parties in such a way that n o party knows the secret, bu t it is easy
to construct the secret by combining some parties shares.
In a two-party case, Alice and Bob share a value z, in such a way t h at
Alice holds (x, n), Bob holds (y, m), and z is equal to (x + y)/(m+n). This

is called secret mean sharing. The result of sharing allows Alice and Bob to
15
obtain the random values r
A
and r
B
, respectively where r
A
+ r
A
= z. The
protocol for this problem will be described in Chapter 5.
Shamir secret sharing is a th r esh ol d scheme [56]. In Shamir secret sharing,
there are n parti es and a pol y n omi al P of degree k − 1 such that P (0) = S
where S is a secret. Each of the n parties holds a point in the polynomial
P . Because k points (x
i
, y
i
) (i = 1 , , k) uniquely define a polynomial P of
degree k −1, a subset of at least k parties can reconstruct the secret S based
on polyn omi al interpolation. But, f ewer than k parties cannot const r u ct the
secret S. This scheme is also called (n, k) S h ami r secret sharing.
2.2.2. Secure sum computation
A simple example of efficient SMC that illustrates the idea of privacy
preserving computations is the secure sum protocol [10].
Assume th at there are n parties P
1
, P
2

, , P
n
such that each P
i
has a
private data item d
i
. Th e parties wish to compute

n
i=1
d
i
, without revealing
their pri vate data d
i
to each other. we assume that

n
i=1
d
i
is in the range
[0, p]. In secure sum protocol, one party is designated as the master party
and is given th e identity P
1
. At t h e beginni n g P
1
chooses a unifor m random
number r from [0, p] and then sends the sum D = d

1
+ r mod p to the par ty
P
2
. Since the value of r is chosen uniformly from [0, p], the number D is
also distrib u t ed uniformly across this region, so P
2
learns nothi n g about the
actual value of d
1
.
Each remaining party P
i
(i = 2, , n) does the following: it recei ves
D = r +
i−1

l=1
d
i
mod p
Since this value is uniformly distributed across [0, p], party i learns nothing.
Party i then computes
(d
i
+ D) mod p = r +
i

l=1
d

i
mod p
and passes it to the party (i + 1) mod n.
16

×