Tải bản đầy đủ (.pdf) (130 trang)

Nghiên cứu xây dựng một số giải pháp đảm bảo an toàn thông tin trong quá trình khai phá dữ liệu

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (605.79 KB, 130 trang )

B GIÁO
, mn D C VÀ ÀO T O

B QU C PHỊNG

VI N KHOA H C VÀ CƠNG NGH QN S
---------------

L

---------------

NG TH D NG

DISTRIBUTED SOLUTIONS IN PRIVACY
PRESERVING DATA MINING
(Nghiên c u xây d ng m t s gi i pháp đ m b o an tồn
thơng tin trong q trình khai phá d li u)

LU N ÁN TI N S TOÁN H C

Hà N i - 2011


B GIÁO D C VÀ ÀO T O

B QU C PHỊNG

VI N KHOA H C VÀ CƠNG NGH QN S
---------------


L

---------------

NG TH D NG

DISTRIBUTED SOLUTIONS IN PRIVACY
PRESERVING DATA MINING
(Nghiên c u xây d ng m t s gi i pháp đ m b o an tồn
thơng tin trong q trình khai phá d li u)
Chuyên ngành: B o đ m toán h c cho máy tính và h th ng tính toán.
Mã s
: 62 46 35 01

LU N ÁN TI N S TOÁN H C
Ng

i h ng d n khoa h c:
1. GIÁO S - TI N S KHOA H C H TÚ B O
2. PHÓ GIÁO S - TI N S B CH NH T H NG

Hà N i - 2011


Pledge
I promise that this thesis is a presentation of my original research work.
Any of the content was written based on the reliable references such as
published papers in distinguished international conferences and journals, and
books published by widely-known publishers. Results and discussions of the
thesis are new, not previously published by any other authors.


i


Contents
1

2

INTRODUCTION

1

1.1

Privacy-preserving data mining: An overview . . . . . . . . .

1

1.2

Objectives and contributions . . . . . . . . . . . . . . . . . .

5

1.3

Related works . . . . . . . . . . . . . . . . . . . . . . . . . . .

7


1.4

Organization of thesis . . . . . . . . . . . . . . . . . . . . . .

12

METHODS FOR SECURE MULTI-PARTY COMPUTATION

13

2.1

Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

2.1.1

Computational indistinguishability . . . . . . . . . . .

13

2.1.2

Secure multi-party computation . . . . . . . . . . . . .

14

2.2


3

Secure computation

. . . . . . . . . . . . . . . . . . . . . . .

15

2.2.1

Secret sharing . . . . . . . . . . . . . . . . . . . . . . .

15

2.2.2

Secure sum computation . . . . . . . . . . . . . . . . .

16

2.2.3

Probabilistic public key cryptosystems . . . . . . . . .

17

2.2.4

Variant ElGamal Cryptosystem . . . . . . . . . . . . .


18

2.2.5

Oblivious polynomial evaluation . . . . . . . . . . . .

20

2.2.6

Secure scalar product computation . . . . . . . . . . .

21

2.2.7

Privately computing ln x . . . . . . . . . . . . . . . . .

22

PRIVACY PRESERVING FREQUENCY-BASED LEARNING IN 2PFD SETTING

24

3.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24


3.2

Privacy preserving frequency mining in 2PFD setting . . . . .

27

3.2.1

Problem formulation

. . . . . . . . . . . . . . . . . .

27

3.2.2

Definition of privacy . . . . . . . . . . . . . . . . . . .

29

3.2.3

Frequency mining protocol . . . . . . . . . . . . . . .

30

ii



3.3

3.4

3.5
4

3.2.4

Correctness Analysis . . . . . . . . . . . . . . . . . . .

32

3.2.5

Privacy Analysis . . . . . . . . . . . . . . . . . . . . .

34

3.2.6

Efficiency of frequency mining protocol

37

. . . . . . . .

Privacy Preserving Frequency-based Learning in 2PFD Setting 38
3.3.1


Naive Bayes learning problem in 2PFD setting . . . .

38

3.3.2

Naive Bayes learning Protocol . . . . . . . . . . . . . .

40

3.3.3

Correctness and privacy analysis . . . . . . . . . . . .

42

3.3.4

Efficiency of naive Bayes learning protocol . . . . . . .

42

An improvement of frequency mining protocol . . . . . . . . .

44

3.4.1

Improved frequency mining protocol . . . . . . . . . .


44

3.4.2

Protocol Analysis . . . . . . . . . . . . . . . . . . . . .

45

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . .

46

ENHANCING PRIVACY FOR FREQUENT ITEMSET MINING IN VERTICALLY ... 49
4.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .

49

4.2

Problem formulation . . . . . . . . . . . . . . . . . . . . . . .

51

4.2.1

Association rules and frequent itemset . . . . . . . . .

51


4.2.2

Frequent itmeset identifying in vertically distributed data 52

4.3

Computational and privacy model . . . . . . . . . . . . . . .

53

4.4

Support count preserving protocol . . . . . . . . . . . . . . .

54

4.4.1

Overview . . . . . . . . . . . . . . . . . . . . . . . . .

54

4.4.2

Protocol design . . . . . . . . . . . . . . . . . . . . . .

56

4.4.3


Correctness Analysis . . . . . . . . . . . . . . . . . . .

57

4.4.4

Privacy Analysis . . . . . . . . . . . . . . . . . . . . .

59

4.4.5

Performance analysis . . . . . . . . . . . . . . . . . . .

61

Support count computation-based protocol . . . . . . . . . .

64

4.5.1

Overview . . . . . . . . . . . . . . . . . . . . . . . . .

64

4.5.2

Protocol Design . . . . . . . . . . . . . . . . . . . . . .


65

4.5.3

Correctness Analysis . . . . . . . . . . . . . . . . . . .

65

4.5.4

Privacy Analysis . . . . . . . . . . . . . . . . . . . . .

67

4.5.5

Performance analysis . . . . . . . . . . . . . . . . . . .

68

Using binary tree communication structure . . . . . . . . . .

69

4.5

4.6

iii



5

4.7

Privacy-preserving distributed Apriori algorithm . . . . . . .

70

4.8

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . .

71

PRIVACY PRESERVING CLUSTERING

73

5.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .

73

5.2

Problem statement . . . . . . . . . . . . . . . . . . . . . . . .


74

5.3

Privacy preserving clustering for the multi-party distributed data 76

5.4

5.5
6

5.3.1

Overview . . . . . . . . . . . . . . . . . . . . . . . . .

76

5.3.2

Private multi-party mean computation . . . . . . . . .

78

5.3.3

Privacy preserving multi-party clustering protocol . .

80

Privacy preserving clustering without disclosing cluster centers 82

5.4.1

Overview . . . . . . . . . . . . . . . . . . . . . . . . .

83

5.4.2

Privacy preserving two-party clustering protocol . . .

85

5.4.3

Secure mean sharing . . . . . . . . . . . . . . . . . . .

87

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . .

88

PRIVACY PRESERVING OUTLIER DETECTION

91

6.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .


91

6.2

Technical preliminaries . . . . . . . . . . . . . . . . . . . . . .

92

6.2.1

Problem statement . . . . . . . . . . . . . . . . . . . .

92

6.2.2

Linear transformation . . . . . . . . . . . . . . . . . .

93

6.2.3

Privacy model . . . . . . . . . . . . . . . . . . . . . .

94

6.2.4

Private matrix product sharing . . . . . . . . . . . . .


95

Protocols for the horizontally distributed data . . . . . . . . .

95

6.3.1

Two-party protocol . . . . . . . . . . . . . . . . . . . .

97

6.3.2

Multi-party protocol . . . . . . . . . . . . . . . . . . . 100

6.3

6.4

Protocol for two-party vertically distributed data . . . . . . . 101

6.5

Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

6.6

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106


SUMMARY

107

Publication List

110

Bibliography

111
iv


List of Phrases
Abbreviation

Full name

PPDM

Privacy Preserving Data Mining

k-NN

k-nearest neighbor

EM

Expectation-maximization


SMC

Secure Multiparty Computation

DDH

Decisional Diffie-Hellman

PMPS

Private Matrices Product Sharing

SSP

Secure Scalar Product

OPE

Oblivious polynomial evaluation

ICA

Independent Component Analysis

2PFD

2-part fully distributed setting

FD


fully distributed setting

c



computational indistinguishability

v


List of Tables
4.1

The communication cost

. . . . . . . . . . . . . . . . . . . .

62

4.2

The complexity of the support count preserving protocol . . .

63

4.3

The parties’s time for the support count preserving protocol


64

4.4

The communication cost

. . . . . . . . . . . . . . . . . . . .

68

4.5

The complexity of the support count computation protocol .

69

4.6

The parties’s time for the support count computation protocol

70

6.1

The parties’s computational time for the horizontally distributed data 105

6.2

The parties’s computational time for the vertically distributed data 105


vi


List of Figures
3.1

Frequency mining protocol . . . . . . . . . . . . . . . . . . . .

33

3.2

The time used by the miner for computing the frequency f

.

38

3.3

Privacy preserving protocol of naive Bayes learning . . . . . .

41

3.4

The computational time for the first phase and the third phrase 43

3.5


The time for computing the key values in the first phase

. .

43

3.6

The time for computing the frequency f in third phrase . . .

44

3.7

Improved frequency mining protocol . . . . . . . . . . . . . .

47

4.1

Support count preserving protocol. . . . . . . . . . . . . . . .

58

4.2

The support count computation protocol. . . . . . . . . . . .

66


4.3

Privacy-preserving distributed Apriori protocol . . . . . . . .

72

5.1

Privacy preserving multi-party mean computation . . . . . .

79

5.2

Privacy preserving multi-party clustering protocol . . . . . .

81

5.3

Privacy preserving two-party clustering . . . . . . . . . . . .

86

5.4

Secure mean sharing . . . . . . . . . . . . . . . . . . . . . . .

89


6.1

Private matrix product sharing (PMPS). . . . . . . . . . . . .

96

6.2

Protocol for two-party horizontally distributed data. . . . . .

98

6.3

Protocol for multi-party horizontally distributed data. . . . . 101

6.4

Protocol for two-party vertically distributed data. . . . . . . . 103

vii


Chapter 1

INTRODUCTION

1.1.


Privacy-preserving data mining: An overview

Data mining plays an important role in the current world and provides
us a powerful tool to efficiently discover valuable information from large
databases [25]. However, the process of mining data can result in a violation of privacy, therefore, issues of privacy preservation in data mining are
receiving more and more attention from the this community [52]. As a result, there are a large number of studies has been produced on the topic of
privacy-preserving data mining (PPDM) [72]. These studies deal with the
problem of learning data mining models from the databases, while protecting
data privacy at the level of individual records or the level of organizations.
Basically, there are three major problems in PPDM [8]. First, the organizations such as government agencies wish to publish their data for researchers
and even community. However, they want to preserve the data privacy, for
example, highly sensitive financial and health private data. Second, a group
of the organizations (or parties) wishes to together obtain the mining result on their joint data without disclosing each party’s privacy information.
Third, a miner wishes to collect data or obtain the data mining models from
the individual users, while preserving privacy of each user. Consequently,
PPDM can be formed into three following areas depending on the models of
information sharing.
Privacy-preserving data publishing: The model of this research consists of only an organization, is the trusted data holder. This organization
wishes to publish its data to the miner or the research community such that
the anonymized data are useful for the data mining applications. For example, some hospitals collect records from their patients for the some required
1


medical services. These hospital can be the trusted data holder, however the
patients may not trust the hospital when they send their data to the miner.
When we publish our anonymized data there are many evidences show
that the publishing data may make privacy breaches via some attacks. One of
them is called re-identification and as showed in [61]. For example, there are
87% of the American population has characteristics that allows identifying
them uniquely based on several public attributes, namely zip code, date of

birth, and sex. Consequently, privacy preserving data publishing has received
many attentions in recent years that aims to prevent re-identification attack,
while preserving the useful information for data mining applications from
the released data.
The general technique called k-anonymity [51], [82],[6], the goal here is
to make the property that obtains the protection of released data to fight
against re-identification possibility of the released data. Consider a private
data table, where the data have been removed explicit identifiers (e.g., SSN
and Name). However, values of other released attributes, such as ZIP, Date
of birth, Marital status, and Sex can also appear in some external sources
that may still joint with the individual users’ identities. If some combinations
of values for these attributes occur uniquely or rarely, then from observing
the data one can determine the identity of the user or deduce a limited set
that consists of user. The goal of k-anonymity is that every tuple in the
released private table is indistinguishability from at least other k users.
Privacy-preserving distributed data mining: This research area
aims to develop distributed data mining algorithms without accessing original data [33, 79, 35, 68, 80, 40]. Different from privacy preserving data
publishing, each study in privacy-preserving distributed data mining is often
to solve a specific data mining task. The model of this area usually consists
of several parties instead, each party has one private data set. The general purpose is to enable the parties for mining cooperatively on their joint
data sets without revealing private information to other participating parties. Here, the way the data is distributed on parties also plays an important
role in the solved problem. Generally, data could be distributed into many
2


parts either vertically or horizontally.
In horizontally distribution, a data set is distributed into several parties.
Every party has data records with the same set of attributes. For example,
the customer databases union of the different banks. Typically, banks have
different services for their clients as a savings account, choice of credit card,

stock investments, etc. Assuming that banks wish to predict who are safe
customers, who may be risk ones and may be frauds. Gathering all kinds of
financial data about their customers and their transactions can help them
in the above predictions, thus they prevent huge financial losses. Using
reasonable techniques for mining on the gathered data can generalize over
these datasets and identify possible risks for future cases or transactions.
More specifically, when a customer Nam goes to Bank A to apply for a loan
to buy a car. He needs to provide the necessary information to the Bank A.
An expert system of Bank A can use k-NN algorithm to classify Nam as either
a risk or safe customer. If this system only uses the database of the bank
A, it could happen that Bank A does not have enough customers that are
close to Nam. Therefore, the system may produce a wrong classification. For
example, Nam is one safe customer but the system recognized him as a risk
customer. Consequently, the Bank A has a loss of profit. It is clear that the
mining on the larger database can result in the more accurate classification.
Thus, the classification on the joint databases of bank A and other banks
might give a more accurate result and Nam could have been classified as
safe. However, the problem is that privacy restrictions would not allow the
banks to access to each other’s databases. So, privacy-preserving distributed
data mining can help this problem. This scenario is a typical case of socalled horizontally partitioned data. In the context of privacy-preserving
data mining, banks do not need to reveal their databases to each other.
They can still apply k-NN classification to the joint databases of banks while
preserving each bank’s privacy information.
In vertically distribution, a data set is distributed into some parties.
Every party owns a vertical part of every record in the database (it holds
records for a subset of the attributes). For example, financial transaction
3


information is collected by banks, while the tax information for everyone

is collected by the IRS. In [71] Jaideep Vaidya et al. show an illustrative
example of two vertical distributed databases as well, one contains medical
records of people while another contains cell phone information for the same
set of people. Mining the joint global database might obtain information like
Cell phones with Li/Ion batteries lead to brain tumors in diabetics.
Privacy-preserving user data mining: This research involves a scenario in which a data miner surveys a large number of users to learn some
data mining results based on the user data or collects the user data while
the sensitive attributes of these users need to be protected [74, 77, 19]. In
this scenario, each user only maintains a data record. This can be thought
of a horizontally partitioned database in which each transaction is owned by
a different user, and it is called fully distributed setting (FD) as well. Unlike
privacy-preserving data publishing, the miner is different from the publisher
that he is un-trusted, thus he could be an attacker who attempts to identify
some sensitive information from the user data. For example, Du et al. [19]
studied to build decision tree on private data. In this study, a miner wants
to collect data from users, and form a central database, then to conduct data
mining on this database. Thus, he gives a survey containing some questions,
each user is requited to answer those questions and sends back the answers.
However, the survey contains some sensitive questions, and user may not
feels comfortable to disclose their answers. Thus, the problem is that how
could the miner obtain the mining model without learning sensitive information about the users. One of requirement for the methods in this area is that
there are not any interactions between users, and each user only communicate to the data miner. However, we are still able to ensure that nothing
about the sensitive data beyond the desired results is revealed to the data
miner.

4


1.2.


Objectives and contributions

Up to now, there are many available solutions for solving the issues in
PPDM. The quality of each solution is evaluated based on the three basic
characteristics: privacy degree, accuracy, and efficiency. But the problem
here is that each solution was only used in a particular distributed scenario
or in a concrete data mining algorithm. Although some of them can be
applied for more than one scenario or algorithm but their accuracy is lower
than acceptable requirement. Other solutions reach the accuracy, however,
their privacy is poor. In addition, it is easily to see that the lack of PPDM
solutions for various practical context as well as well-known data mining
techniques. In this thesis, we aim at solving some issues in PPDM as follows:
1. To introduce a new scenario for privacy-preserving user data mining
and find a good privacy solution for a family of frequency-based learning
algorithms in this scenario.
2. To develop novel privacy-preserving techniques for popular data mining algorithms such as association rule mining and clustering methods.
3. To present a technique to design protocols for privacy- preserving
multivariate outlier detection in both horizontally and vertically distributed
data models.
The developed solutions will be evaluated in terms of the degree of privacy
protection, correctness, usability to the real life applications, efficiency and
scalability.
The contributions of this thesis is to provide solutions for four problems
in PPDM. Each problem has an independent statement to the others, but
they share a common interpretation that given a data set being distributed
into several parties (or users), our task is to mine knowledge from all parties’
joint data while preserving privacy of each party. The difference among those
problems lies in various distributed data models (scenarios), and various
proposed functions to keep the privacy information for parties. Summarizing,
our contributions in this thesis are as follows.


5


• The first work (Chapter 3), we propose a new scenario for privacypreserving user data mining called 2-part fully distributed setting (2PFD)
and find solution for a family of frequency-based learning algorithms
in 2PFD setting. In 2PFD, the dataset is distributed across a large
number of users in which each record is owned by two different users,
one user only knows the values for a subset of attributes and the other
knows the values for the remaining attributes. A miner aims to learn,
for example, classification rules on their data, while preserving each
users privacy. In this work we develop a cryptographic solution for
frequency-based learning methods in 2PFD. The crucial step in the
proposed solution is the privacy-preserving computation of frequencies
of a tuple of values in the users data, which can ensure each users privacy without loss of accuracy. We illustrate the applicability of the
method by using it to build the privacy preserving protocol for the
naive Bayes classifier learning, and briefly address the solution in other
applications. Experimental results show that our protocol is efficient
• The second contribution of this thesis (Chapter 4) is the novel protocols
for privacy-preserving frequent itemset mining in vertically distributed
data. These protocols allow a group of parties cooperatively mine
frequent itemsets in distributed setting without revealing each party’s
portion of the data to the other. The important security property of
our protocols is better than the previous protocols’ one in the way that
we achieve the full privacy protection for each party. This property
does not require the existence of any of trusted parties. In addition,
no collusion of parties can make privacy breaches.
• For third work (Chapter 5), we present the expectation maximization
mixture model clustering method for distributed data that preserves
privacy for data of participating parties. Firstly, privacy preserving

EM-based clustering method for multi-party distributed data proposed.
Unlike the existing method, our method does not reveal sum results
of numerator and denominator in the secure computation for the pa6


rameters of EM algorithm, therefore, the proposed method is more
secure and it allows the number of participating parties to be arbitrary. Secondly, we propose the better method for the case in which
the dataset is horizontally partitioned into only two parts, this method
allows computing covariance matrices and final results without revealing the private information and the means. To solve this one, we have
presented a protocol based on the oblivious polynomial evaluation and
the secure scalar product for addressing some problems, such as the
means, covariance matrix and posterior probability computation. The
approach of paper allows two or many parties to cooperatively conduct clustering on their joint data sets without disclosing each partys
private data to the other.
• In fourth work (chapter 6), we study some parties - each has a private
data set - want to conduct the outlier detection on their joint data
set, but none of them want to disclose its private data to the other
parties. We propose a linear transformation technique to design protocols of secure multivariate outlier detection in both horizontally and
vertically distributed data models. While different from the most of
previous techniques in a privacy preserving fashion for distance-based
outliers detection. Our focus is other non-distance based techniques
for detecting outliers in statistics.

1.3.

Related works

Recently a lot of solutions have been proposed for PPDM. Those solutions
can be categorized into two main approaches: Secure multiparty computation
(SMC) approach and randomization approach.

The basic idea of randomization approach is to perturb the original (private) dataset and the result is released for data analysis. The perturbation
has to ensure that original individual data values cannot be recovered, while
preserving the utility of the data for statistical properties. Thus it allows
7


patterns in the original data to be mined. There are two main perturbation
techniques: random transformation and randomization. First transforms
each data value (record) into a random value (record) of the same domain
with the original data in ways that preserve certain statistics, but hide real
values [21, 4, 19, 3]. Second adds noise to data to prevent discovery of the
real. Randomization has to ensure that given the distribution of the noise
added to the data, and the randomized data set, it can be reconstructed the
distribution (but not actual data values) of the data set [1, 36, 16].
The typical example is the algorithm of Agrawal-Srikant [2], which values
of an attribute are discretized into intervals and each original value is assigned
to an interval. Then, the original data distribution is reconstructed by a
Bayesian approach, and based on the reconstructed distribution the decision
trees can be induced. Many other distribution reconstruction methods have
also been introduced. In [1], Agrawal et al. developed an approach based
on Expectation Maximization that also gave a better definition of privacy,
and an improved algorithm. Evmievski et al. [21] used a similar technique
for association rules mining. Polat et al. proposed a privacy preserving
collaborative filtering method using randomized techniques [53], etc.
Although perturbation techniques are very efficient, their use generally
involves a tradeoff between privacy and accuracy, if we require the more
privacy, the miner loses more accuracy in the data mining results, and viceversa. Even the very technique that allow us to reconstruct distributions also
reveal information about the original data values. For example, consider the
case of randomizing age attribute. In principle, there are no drivers under
18 in the general distribution of age. Thus, assume that randomization

is implemented by adding noise randomly chosen from the range [-10,10].
Although the reconstructed distribution does not show us any true age value,
it only give the age of a driver be 40 years old that corresponds to the
true age in the range [30 50]. However if an age value in the randomized
set is 7, we know that no drivers are under the age of 18, so the driver
whose age is given as 7 in the randomized data must be 18 years old in the
original data. Thus, some works has been done to measure the privacy of
8


randomization techniques for the purpose that they must be used carefully
to obtain the desired privacy. Kargupta et al. [36] formally analyze the
privacy of randomization techniques and show that many cases reveal privacy
information. Evmievski et al. [20] show how to limit privacy breaches when
using the randomization technique for privacy preserving data mining.
Many privacy preserving data mining algorithm based on SMC has proposed as well. They can be described as a computational process where
a group of parties computes a function based on private inputs, but neither party wants to disclose its own input to any other party. The secure
multiparty computation framework was developed by Goldreich[24]. In this
framework, multiparty protocols can fall into either the semi-honest model
or malicious adversary model. In the semi-honest model, the parties are
assumed that follows the protocol rules, but after the execution of the protocol has completed, the parties still try to learn additional information by
analyzing the messages they received during the execution of the protocol.
In the malicious adversary model, it is assumed that the parties can execute
some arbitrary operations to damage to other parties. Thus, the protocol
design in this model are much more difficult than in the semi-honest model.
However, in current the semi-honest model is usually used for the context of
privacy preserving data mining. The formal definitions of SMC was stated
in [24].
The secure multi-party computation problem was first proposed by Yao
[78] where he gave the method to solve Yao’s Millionaire problem that allows

comparing the worth of two millionaires without revealing any privacy information of each people. According to theoretical studies of Goldreich, the
general SMC problem can be solved by the circuit evaluation method. However, using this solution is not practical in terms of the efficiency. Therefore,
finding efficient problems specific solutions was seen as an important research
direction. In the recent years, many specific solutions were introduced for
the different research areas such as information retrieval, computational geometry, statistical analysis, etc. [17, 10, 70].
Randomization approaches [21, 4, 19, 3, 1, 36, 16] can be used in the
9


fully distributed scenario where a data miner wants to obtain classification
models from the data of large sets of users. These users can simply randomize their data and then submit their randomized data to the miner who can
later reconstruct some useful information. However for obtaining strong privacy without loss of accuracy, in [74, 77] the SMC techniques have proposed.
The key idea of these techniques is a private frequency computation method
that allows a data miner to compute frequencies of values or tuples in the
FD setting, while preserving privacy of each user’s data. In Chapter 3, we
proposed a SMC solution which allows the miner to learn frequency-based
models in 2PFD setting. Note that in this setting, each user may only know
some values of the tuple but not all. Therefore, the above mentioned cryptographic approaches can not be used in 2PFD setting. In the FD setting,
other solutions based on k-anonymization of user’s data have been proposed
in [83, 77]. The advantage of these solutions is that they do not depend on
the underlying data mining tasks, because the anonymous data can be used
for various data mining tasks without disclosing privacy. However, these solutions are inapplicable in 2PFD setting, because the miner can not link two
anonymous parts of one object with each other.
The SMC approaches are usually are used for privacy-preserving distributed data mining as well, where data are distributed across several parties. Thus, the privacy property of privacy-preserving distributed data mining algorithms is quantified by the privacy definition of SMC, where each
party involved in the privacy-preserving distributed protocols is only allowed
to learn the desired data mining models without any other information. Generally, each protocol has to be designed for specific work for the reason of
efficiency and privacy. Currently specific privacy-preserving distributed protocols have been proposed to address different data mining problems across
distributed databases, e.g., in [70, 63, 18], they developed a privacy preserving classification protocols from the vertically distributed data based on a
secure scalar product method that consists of privacy preserving protocols
for learning naive Bayes classification, association rules and decision trees. In

[34], the privacy preserving naive Bayes lassification was addressed for hori10


zontally distributed data by computing the secure sum of all local frequencies
of participating parties. Our work in Chapter 4 is to present the frequent
itemset mining protocols for vertically partitioned data. Distributed association rules/itemsets mining has been addressed for both vertically partitioned
data and horizontally partitioned data [33, 79, 35, 68, 80]. However, to the
best of our knowledge, these protocols preserves the privacy of each party
and only resist the collusion at most n − 2 corrupted parties. Our protocols
for privacy preserving frequent mining involving multiple parties can protect
the privacy of each party and gainst the collusion, up to n − 1 corrupted
parties.
For the related work with privacy preserving distributed clustering. Recently, privacy preserving clustering problems have also been studied by
many authors. In [49] and [47], the authors focused on different transformation techniques that enable the data owner to share the data with the
other party who will cluster it. Clifton and Vaidya proposed a secure multiparty computation of k-means algorithm on vertically partitioned data [66].
In [29], the authors proposed a solution for privacy preserving clustering on
horizontally partitioned data, where they primarily focused on hierarchical
clustering methods that can both discover clusters of arbitrary shapes and
deal with different data types. In [59], Kruger et al. proposed a privacy preserving, distributed k-means protocol on horizontally partitioned data that
the key step is privacy preserving of cluster means. At each iteration of the
algorithm, only means are revealed to parties without other things. But, revealing means might allow the parties to learn some extra information of each
other. To our knowledge, there is so far only one secure method for the expectation maximization (EM) mixture model from horizontally distributed
sources [40] based on secure sum computation. However, this method requires at least three participating parties. Because the global model is a
sum of local models, in case only two parties, which often happens in practice, each party could compute other party’s local model by subtracting its
local model from the global model. The aim of this work in chapter 5 is
firstly to develop a more general protocol which allows the number of par11


ticipating parties to be arbitrary and more secure. Secondly, we propose a
better method for the case in which the dataset is horizontally partitioned

into only two parts.
For privacy preserving outlier detection, while there is a number of different definitions for outliers as well as techniques to find them, only some
currently developed methods in a privacy preserving fashion for distancebased outliers detection. There are other non-distance based techniques for
detecting outliers in statistics [12], but there are still no work on finding
them in a privacy preserving fashion [3]. The Mahalanobis distance has
been used in several work for outlier detection [10], [12]. In chapter 6, we
proposed solutions of privacy preserving outlier detection in both vertically
and horizontally distributed data. Our work related to the work of secure
sound classification in [57] in which the gaussian single model are used for
classification. However, this work solved the scenario of the two-party secure classification. In this scenario, the parties engage in a protocol that
allows one party to classify her data using other’s classifier without revealing
anything her private information. Also, she will learn nothing about the
classifier. Therefore, our purpose and method is different from this work.

1.4.

Organization of thesis

The thesis consists of six chapters, 109 pages of A4. Chapter 1 presents
an overview of PPDM and related works. Chapter 2 presents the basic definitions of secure multi-party computation and the techniques I frequently
use. Chapter 3 proposes privacy preserving frequency-based learning algorithms in 2PFD. Chapter 4 presents two privacy-preserving algorithms for
distributed mining of frequent itemsets. Chapter 5 discusses privacy preserving EM-based clustering protocols. Chapter 6 presents privacy preserving
outlier detection for both vertically distributed data and horizontally distributed data. The summary of this thesis is presented in the last section.

12


Chapter 2

METHODS FOR SECURE MULTI-PARTY

COMPUTATION
In this thesis, we use secure multi-party computation (SMC) and cryptographic tools as the building blocks to design privacy-preserving data mining
protocols. Before discussing in details, in this chapter, we first review some
important definitions of SMC. Then, we summarize the techniques which
will be used in the next chapters.

2.1.
2.1.1.

Definitions
Computational indistinguishability

In this section, we review basic definitions from computational complexity
theory that will be used in this thesis [24].
The following is the standard definition of a negligible function.
Definition 2.1. Let N be the set of natural numbers. We say the function
ϵ(·) : N → (0, 1] is negligible in n, if for every positive integer polynomial
poly(·) there exists an integer n0 > 0 such that for all n > n0
ϵ(n) <

1
poly(n)

The computational indistinguishability is another important concept when
discussing the security properties of distributed protocols [24]. Let X =
{Xn }n∈N is an ensemble indexed by a security parameter n (which usually
refers to the length of the input), where the Xi′ s are random variables.
Definition 2.2. Two ensembles, X = {Xn }n∈N and Y = {Yn }n∈N , are
computational indistinguishable in polynomial time if for every probabilistic
13



polynomial time algorithm A,
|P r(A(Xn ) = 1) − P r(A(Yn ) = 1)|
c

c

is a negligible function in n. In such case, we write X ≡ Y , where ≡ denotes
computational indistinguishability.

2.1.2.

Secure multi-party computation

This section reviews the secure multiparty computation framework developed by Goldreich[24].
Secure multiparty computation function
In a distributed network with n participating parties. A secure n-party
computation problem can generally be considered as a computation of a
function:
f (x1 , x2 , ..., xn ) → (f1 (x1 , x2 , ..., xn ), ..., fn (x1 , x2 , ..., xn ))
where each party i knows only its private input xi . For security, it is required that the privacy of any honest party’s input is protected, in the
sense that each dishonest party i learns nothing except its own output
yi = fi (x1 , x2 , ..., xn ). If there is any malicious party that may deviate from
the protocol, it is also required that each honest party get a correct result
whenever possible.
Privacy in Semi-honest model
In the distributed setting, let π be an n-party protocol for computing f . Let x denote (x1 , ..., xn ). The view of the ith (i ∈ [1, n]) party
during an execution of π on x is denoted by viewπ (x) which includes xi ,
all received messages, and all internal coin flips. For every subset I of

[1, n], namely I = {i1 , .., it }, let fI (x) denote (yi1 , ..., yit ) and viewIπ (x) =
(I, viewiπ1 (x), ..., viewiπt (x)). Let OU T P U T (x) denotes the output of all parties during the execution of π.
14


Definition 2.3. An n-party computation protocol π for computing f (., .., .)
is secure with respect to semi-honest parties, if there exists a probabilistic
polynomial-time algorithm denoted by S, such that for every I ⊂ [1, n] we
have
c

{S(xi1 , .., xit , fI (x)), f (x))} ≡ {viewIπ (x), OU T P U T (x)}
This definition states that the view of the parties in I can be simulated
from only the parties’ inputs and outputs. If the function is privately computed by the protocol, then privacy of each party’s input data is protected.
In this thesis, we focus on designing privacy-preserving protocols in the semihonest model. The formal definition of the security protocol in the malicious
model can be found in [24].
In this thesis, we also use composition theorem for the semi-honest model
that its discussion and proof can be found in [24]. The composition theorem
states that a protocol can be decomposed into several sub-protocols, then
security of the protocol will be proved if we can show that its subprotocols
are secure.
Theorem 2.1 (Composition theorem). Suppose that g is privately reducible
to f , and that there exists a protocol for privately computing f . Then there
exists a protocol for privately computing g.

2.2.
2.2.1.

Secure computation
Secret sharing


Secret sharing refers to any method by which a secret can be shared by
multiple parties in such a way that no party knows the secret, but it is easy
to construct the secret by combining some parties shares.
In a two-party case, Alice and Bob share a value z, in such a way that
Alice holds (x, n), Bob holds (y, m), and z is equal to (x + y)/(m+n). This
is called secret mean sharing. The result of sharing allows Alice and Bob to

15


obtain the random values rA and rB , respectively where rA + rA = z. The
protocol for this problem will be described in Chapter 5.
Shamir secret sharing is a threshold scheme [56]. In Shamir secret sharing,
there are n parties and a polynomial P of degree k − 1 such that P (0) = S
where S is a secret. Each of the n parties holds a point in the polynomial
P . Because k points (xi , yi ) (i = 1, .., k) uniquely define a polynomial P of
degree k − 1, a subset of at least k parties can reconstruct the secret S based
on polynomial interpolation. But, fewer than k parties cannot construct the
secret S. This scheme is also called (n, k) Shamir secret sharing.

2.2.2.

Secure sum computation

A simple example of efficient SMC that illustrates the idea of privacy
preserving computations is the secure sum protocol [10].
Assume that there are n parties P1 , P2 , ..., Pn such that each Pi has a
∑n
private data item di . The parties wish to compute i=1 di , without revealing

∑n
their private data di to each other. we assume that i=1 di is in the range

[0, p]. In secure sum protocol, one party is designated as the master party
and is given the identity P1 . At the beginning P1 chooses a uniform random
number r from [0, p] and then sends the sum D = d1 + r mod p to the party
P2 . Since the value of r is chosen uniformly from [0, p], the number D is

also distributed uniformly across this region, so P2 learns nothing about the
actual value of d1 .
Each remaining party Pi (i = 2, .., n) does the following: it receives
D=r+

i−1


di

mod p

l=1

Since this value is uniformly distributed across [0, p], party i learns nothing.
Party i then computes
(di + D)

mod p = r +

i


l=1

and passes it to the party (i + 1) mod n.
16

di

mod p


×