Data Mining and Knowledge Discovery Handbook, 2 Edition part 72 potx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (111.67 KB, 10 trang )

690 Vicenc¸ Torra
35.2.1 Computation-Driven Protection Procedures: the Cryptographic
Approach
As stated above, cryptographic protocols are often applied in applications where the analysis
(or function) to be computed from the data is known. In fact, it is usually applied to scenarios
with multiple data sources. We illustrate below this scenario with an example.
Example 1. Parties P
1
, ,P
n
own databases DB
1
, ,DB
n
. The parties want to compute a
function, say f , of these databases (i.e., f (DB
1
, ,DB
n
)) without revealing unnecessary in-
formation. In other words, after computing f (DB
1
, ,DB
n
) and delivering this result to all
P
i
, what P
i
knows is nothing more than what can be deduced from his DB
i

and the function f .
So, the computation of f has not given P
i
any extra knowledge.
Distributed privacy preserving data mining is based on the secure multiparty computation,
which was introduced by A. C. Yao in 1982 (Yao, 1982). For example, (Lindell and Pinkas,
2000) and (Lindell and Pinkas, 2002) deﬁned a method based on cryptographic tools for com-
puting a decision tree from two data sets owned by two different parties. (Bunn and Ostrovsky,
2007) discusses clustering data from different parties.
When data is represented in terms of records and attributes, two typical scenarios are
considered in the literature: vertical partitioning of the data and horizontal partitioning. They
are as follows.
• Vertically partitioned data. All data owners share the same records, but different data own-
ers have information about different attributes (i.e., different data owners have different
views of the same records or individuals).
• Horizontally partitioned data. All data owners have information about the same attributes,
nevertheless the records or individuals included in their data bases are different.
As stated above, for both centralized and distributed PPDM the only information that
should be learnt by the data owners is the one that can be inferred from his original data
and the ﬁnal computed analysis. In this setting, the centralized approach is considered as a
reference result when analyzing the privacy of the distributed approach. Privacy leakage for
the distributed approach is usually analyzed considering two types of adversaries.
• Semi-honest adversaries. Data owners follow the cryptographic protocol but they anal-
yse all the information they get during its execution to discover as much information as
they can.
• Malicious adversaries. Data owners try to fool the protocol (e.g. aborting it or sending
incorrect messages on purpose) so that they can infer conﬁdential information.
Computation-driven protection procedures using cryptographic approaches present some
clear advantges with respect to general purpose ones. The ﬁrst one is the good quality of the
computed function (analysis). That is, the function we compute is exactly the one the users

want to compute. This is not so, as we will see later, when other general purpose protection
methods are used. In this latter case, the resulting function is just an approximation of the
function we would compute from the original data. At the same time, cryptographic tools
ensure an optimal level of privacy.
Nevertheless, this approach has some limitations. The ﬁrst one is that we need to know
beforehand the function (or analysis) to be computed. As different functions lead to differ-
ent cryptographic protocols, any change on the function to be computed (even small ones)
requires a redeﬁnition of the protocol. A second disadvantage is that the computational costs
35 Privacy in Data Mining 691
of the protocols are very high. In addition, it is even harder when malicious adversaries are
considered. (Kantarcioglu, 2008) discusses other limitations. One is that most literature only
considers the types of adversaries described above (honest, semi-honest and malicious). No
other types are studied. Another one is the fact that in these methods no trade-off can be found
between privacy and information loss (they use the term accuracy). As we will see later in
Sections 35.5 and 35.6, most general purpose protection procedures permit the user to select
an appropriate trade-off between these two contradictory issues. When using cyptographic
protocols, the only trade-off that can be implemented easily is the one between privacy and
efﬁciency.
35.2.2 Data-driven Protection Procedures
Given a data set, data-driven protection procedures construct a new data set so that the new
one does not permit a third party to infer conﬁdential information present in the original data.
Different methods have been developed for this purpose. We will focus on the case where the
data set is a standard ﬁle deﬁned in terms of records and attributes (microdata following the
jargon of statistical disclosure control). As stated above, we can also consider other types of
data sets as e.g. aggregate data (tabular data following the jargon of SDC).
All data-driven methods are similar in the sense that they construct the new data set reduc-
ing the quality of the original one. As quality reduction might cause data to be unsuitable for a
particular analysis, measures have been developed to evaluate in what extent the protected data
set is still valid. These measures are known as information loss measures or utility measures.
Data-driven procedures are much more efﬁcient with respect to computational cost than

the ones using cryptographic protocols. Nevertheless, this major efﬁciency is at the cost of
not ensuring complete privacy. Due to this, some measures of risk have been developed to
determine in which extent a protected data set ensures privacy. These measures are known as
disclosure risk measures.
These two families of measures, information loss and disclosure risk, are in contradiction
and, thus, methods should look for an appropriate trade-off between risk and utility. Tools have
been developed to visualize this trade-off and also to quantify this trade-off, so that protection
methods can be compared.
We will present some of the data protection procedures in Section 35.4, information loss
measures in Section 35.5, and visualization methods in Section 35.6. Section 35.3 includes a
description of the standard scenario for evaluating risk before reviewing disclosure risk mea-
sures.
35.3 Disclosure Risk Measures
Disclosure risk is deﬁned in terms of the additional conﬁdential information (in general, addi-
tional knowledge) that an intruder can acquire from the protected data set. According to (Lam-
bert, 1993, Paass, 1985), disclosure risk can be studied from two perspectives:
• Identity disclosure. This disclosure takes place when a respondent is linked to a particular
record in the protected data set. This process of linking is known as re-identiﬁcation (of
the respondent).
• Attribute disclosure. In this case, deﬁning disclosure as the disclosure of the identity
of the individual is considered too strong. Disclosure takes place when the intruder can
692 Vicenc¸ Torra
learn something new about an attribute of a respondent, even when no relationship can
be established between the individual and the data. That is, disclosure takes place when
the published data set permits the intruder to increase his accuracy on an attribute of the
respondent. This approach was ﬁrst formulated in (Dalenius, 1977) (see also (Duncan and
Lambert, 1986) and (Duncan and Lambert, 1989)).
Interval disclosure is a measure, proposed in (Domingo-Ferrer et al., 2001) and (Domingo-
Ferrer and Torra, 2001b), for attribute disclosure. It is deﬁned according to the following pro-
cedure. Each attribute is independently ranked and a rank interval is deﬁned around the value

the attribute takes on each record. The ranks of values within the interval for an attribute
around record r should differ less than p percent of the total number of records and the rank
in the center of the interval should correspond to the value of the attribute in record r. Then,
the proportion of original values that fall into the interval centered around their corresponding
protected value is a measure of disclosure risk. A 100 percent proportion means that an at-
tacker is completely sure that the original value lies in the interval around the protected value
(interval disclosure).
Identity disclosure has received much attention in the last years and has been used to
evaluate different protection methods. Its formulation needs a concrete scenario. We present
it in the next section. Some identity disclosure risk measures will be reviewed later using this
scenario.
35.3.1 An Scenario for Identity Disclosure
The typical scenario is to consider the protected data set and an intruder having some partial
information about the individuals in the published data set. The protected data set is assumed
to be a data ﬁle, and it is usual to consider that intruder’s information can be represented in
the same way. See e.g. (Sweeney, 2002, Torra et al., 2006). Formally, we consider data sets
X with the usual structure of r rows (records) and k columns (attributes). Naturally, each row
contains the values of the attributes for an individual.
Then, the attributes in X can be classiﬁed (Dalenius, 1986, Samarati, 2001, Torra et al.,
2006) in three non-disjoint categories.
• Identiﬁers. These are attributes that unambiguously identify the respondent. Examples
are passport number, social security number, full name, etc.
• Quasi-identiﬁers. These are attributes that, in combination, can be linked with external
information to re-identify some of the respondents. Examples are age, birth date, gender,
job, zipcode, etc. Although a single attribute cannot identify an individual, a subset of
them can.
• Conﬁdential. These are attributes which contain sensitive information on the respondent.
For example, salary, religion, political afﬁliation, health condition, etc.
Using these three categories, an original data set X is deﬁned as X = id||X
nc

||X
c
, where id
are the identiﬁers, X
nc
are the non-conﬁdential quasi-identiﬁer attributes, and X
c
are the conﬁ-
dential attributes. Let us consider the protected data set X

. X

is obtained from the application
of a protection procedure to X. This process takes into account the type of the attributes. It is
usual to proceed as follows.
• Identiﬁers. To avoid disclosure, identiﬁers are usually removed or encrypted in a prepro-
cessing step. In this way, information cannot be linked to speciﬁc respondents.
• Conﬁdential. These attributes X
c
are usually not modiﬁed. So, we have X

c
= X
c
.
35 Privacy in Data Mining 693
• Quasi-identiﬁers. They cannot be removed as almost all attributes can be quasi-
identiﬁers. The usual approach to preserve the privacy of the individuals is to apply pro-
tection procedures to these attributes. We will use
ρ

to denote the protection procedure.
Therefore, we have X

nc
=
ρ
(X
nc
).
Therefore, we have X

=
ρ
(X
nc
)||X
c
. Proceeding in this way, we allow third parties to
have precise information on conﬁdential data without revealing to whom the conﬁdential data
belongs to.
In this scenario we have identity disclosure when an intruder, having some information
described in terms of a set of records and some quasi-identiﬁers, can link his information with
the published data set. That is, he is able to link his records with the ones in the protected data
set. Then, if the links between records are correct, he will be able to obtain the right values for
the conﬁdential attributes.
Figure 35.1 represents this situation. A represents the ﬁle with data from the protected data
set (i.e., containing records from X

) and B represents the ﬁle with the records of the intruder.
B is usually deﬁned in terms of the original data set X, because it is assumed that the intruder

has a subset of X . In general, the number of records owned by the intruder and the number of
records in the protected data ﬁle will differ.
Reidentiﬁcation is achieved using some common quasi-identiﬁers on both X and X

. They
permit to link pairs of records (using record linkage algorithms) from both ﬁles, and, then, the
conﬁdential attribute is linked to the identiﬁers. At this point reidentiﬁcation is achieved.
Formally, following (Torra et al., 2006, Nin et al., 2007, Sweeney, 2002) and the nota-
tion in Figure 35.1, the intruder is assumed to know the non-conﬁdential quasi-identiﬁers
X
nc
= {a
1
, ,a
n
} together with the identiﬁers Id = {i
1
,i
2
, }. Then, the linkage is between
identiﬁers (a
1
, ,a
n
) from the protected data (X

nc
) and the same attributes from the intruder
(X
nc

).
35.3.2 Measures for Identity Disclosure
Two main approaches exists for measuring identity disclosure risk. They are known by unique-
ness and re-identiﬁcation. We describe them below.
• Re-identiﬁcation. Risk is deﬁned as an estimation of the number of re-identiﬁcations
that might be obtained by an intruder. This estimation is obtained empirically through
record linkage algorithms. This approach for measuring disclosure risk goes back, at
least, to (Spruill, 1983) and (Paass, 1985) (using e.g. the algorithm described in (Paass
and Wauschkuhn, 1985)). (Torra et al., 2006, Nin et al., 2007, Sweeney, 2002) are more
recent papers using this approach. This approach is general enough to be applied in differ-
ent contexts. It can be applied under different assumptions of intruder’s knowledge, and
under different assumptions on protection procedures. It can even be applied when pro-
tected data has been generated using a synthetic data generator (i.e., data is constructed
using a particular data model – see Section 35.4.3 for details). For example, (Torra et al.,
2006) describes empirical results about using record linkage algorithms on synthetic data.
The performance of different algorithms is discussed. (Winkler, 2004) considers a similar
problem.
• Uniqueness. Informally, the risk of identity disclosure is measured as the probability that
rare combinations of attribute values in the protected data set are indeed rare in the original
population.
This approach is typically used when data is protected using sampling (Willenborg, 2001)
(i.e., X

is just a subset of X ). Note that with perturbative methods it makes no sense to
694 Vicenc¸ Torra
(protected / public)
identiﬁersquasi-
identiﬁers
quasi-
identiﬁers

conﬁdential
r
1
r
a
s
1
s
b
a
1
a
n
a
1
a
n
i
1
, i
2
,
B (intruder)A
a
b
Re-identiﬁcation
Record linkage
Fig. 35.1. Disclosure Risk Scenario.
investigate the probability that a rare combination of protected values is rare in the original
data set, because that combination is most probably not found in the original data set.

In the next sections we describe these two approaches in more detail.
Uniqueness
Two types of disclosure risk measures based on uniqueness can be distinguished: ﬁle-level and
record-level. We describe them below.
• File-level uniqueness. Disclosure risk is deﬁned as the probability that a sample unique
(SU) is a population unique (PU) (Elliot et al., 1998). According to (Elamir, 2004), this
probability can be computed as
P(PU |SU )=
P(PU ,SU )
P(SU )
=
∑
j
I(F
j
= 1, f
j
= 1)
∑
j
I(f
j
= 1)
where j = 1, ,J denotes possible values in the sample, F
j
is the number of individuals in
the population with key value j (frequency of j in the population), f
j
is the same frequency
for the sample and I stands for the cardinality of the selection. Unless the sample size is

much smaller than the population size, P(PU |SU ) can be dangerously high; in that case,
an intruder who locates a unique value in the released sample can be almost certain that
there is a single individual in the population with that value, which is very likely to lead
to that individual’s identiﬁcation.
35 Privacy in Data Mining 695
• Record-level risk uniqueness. They are also known as individual risk measures. Dis-
clsoure risk is deﬁned as the probability that a particular sample record is re-identiﬁed,
i.e. recognized as corresponding to a particular individual in the population. As (Elliot,
2002) points out, the main rationale behind this approach is that risk is not homogeneous
within a data ﬁle. We summarize next the description given in (Franconi.Polettini.2004)
of the record-level risk estimation.
Assume that there are K possible combinations of key attributes. These combinations
induce a partition both in the population and in the released sample. If the frequency of
the k-th combination in the population was known to be F
k
, then the individual disclosure
risk of a record in the sample with the k-th combination of key attributes would be 1/F
k
.
Since the population frequencies F
k
are generally unknown but the sample frequencies f
k
of the combinations are known, the distribution of frequencies F
k
given f
k
is considered.
Under reasonable assumptions, the distribution of F
k

|f
k
can be modeled as a negative
binomial. The per-record risk of disclosure is then measured as the posterior mean of
1/F
k
with respect to the distribution of F
k
|f
k
.
Record Linkage
This approach for measuring disclosure risk directly follows the scenario in Figure 35.1. That
is, record linkage consists of linking each record b of the intruder (ﬁle B) to a record a in
the original ﬁle A. The pair (a, b) is a match if b turns out to be the original record corre-
sponding to a. For applying record linkage, the common approach is to use the shared at-
tributes (some quasi-identiﬁers). As the number of matches is an estimation of the number of
re-identiﬁcations that an intruder can achieve, disclosure risk is deﬁned as the proportion of
matches among the total number of records in B.
Two main types of record linkage algorithms are described in the literature: distance-
based and probabilistic. They are outlined below. For details on these methods see (Torra and
Domingo-Ferrer, 2003).
• Distance-based record linkage. Each record b in B is linked to its nearest record a in A.
An appropriate deﬁnition of a record-level distance has to be supplied to the algorithm
to express nearness. This distance is usually constructed from distance functions deﬁned
at the level of attributes. In addition, we need to standardize attributes as well as assign
weights to them.
(Pagliuca et al., 1999) proposed distance-based record linkage to assess the disclosure risk
for microaggregation. They used Euclidean distance and equal weight for all attributes.
Later, in (Domingo-Ferrer and Torra, 2001b), distance-based record linkage (also with

Euclidean distance and equal weights) was used for evaluating other masking methods
as well. In their empirical work, distance-based record linkage outperforms probabilistic
record linkage (See Section 35.3.2 below).
The main advantages of using distances for record linkage are simplicity for the imple-
menter and intuitiveness for the user. Another strong point is that subjective information
(about individuals or attributes) can be included in the re-identiﬁcation process by means
of appropriate distances.
The main difﬁculties for distance-based record linkage are (i) the selection of the ap-
propriate distance function, and (ii) the determination of the weights. In relation to the
distance function, for numerical data, the Euclidean distance is the most used distance.
Nevertheless, other distances have also been used as e.g. Mahalanobis (Torra et al., 2006),
and some Kernel-based ones (Torra et al., 2006). The difﬁculty of choosing a distance is
696 Vicenc¸ Torra
especially thorny in the cases of categorical attributes and of masking methods such as
local recoding where the masked ﬁle contains new labels with respect to the original data
set. The determination of the weights is also a rellevant problem that is difﬁcult to solve.
In the case of the Euclidean distance, it is common to assign equal weights to all attributes,
and in the case of the Mahalanobis distance, this problem is avoided because weights are
extracted from the covariance matrix.
• Probabilistic record linkage
Probabilistic record linkage also links pairs of records (a,b) in data sets A and B, respec-
tively. For each pair, an index is computed. Then, two thresholds LT and NLT in the index
range are used to label the pair as linked, clerical or non-linked pair: if the index is above
LT , the pair is linked; if it is below NLT, the pair is non-linked; a clerical pair is one that
cannot be automatically classiﬁed as linked or non-linked. When independence between
attributes is assumed, the index can be computed from the following two conditional prob-
abilities for each attribute: the probability P(1|M) of coincidence between the values of
the attribute in two records a and b given that these records are a real match, and the prob-
ability P(0|U) of non-coincidence between the values of the attribute given that a and b
are a real unmatch.

To use probabilistic record linkage in an effective way, we need to set the thresholds LT
and NLT and estimate the conditional probabilities P(1|M) and P(0|U) used in the com-
putation of the indices. In plain words, thresholds are computed from: (i) the probability
P(LP|U) of linking a pair that is an unmatched pair (a false positive or false linkage) and
(ii) the probability P(NP|M) of not linking a pair that is a match (a false negative or false
unlinkage). Conditional probabilities P(1|M) and P(0|U) are usually estimated using the
EM algorithm (Dempster et al., 1977).
The original description of probabilistic record linkage can be found in (Fellegi and
Sunter, 1969) and (Jaro, 1989). (Torra and Domingo-Ferrer, 2003) describe the method
in detail (with examples) and (Winkler, 1993) presents a review of the state of the art on
probabilistic record linkage. In particular, this latter paper includes a discussion concern-
ing non-independent attributes. A (hierarchical) graphical model has recently been pro-
posed (Ravikumar and Cohen, 2004) that compares favorably with previous approaches.
Probabilistic record linkage methods are less simple than distance-based ones. However,
they do not require rescaling or weighting of attributes. The user only needs to provide the
two probabilities P(LP|U) (false positives) and P(NP|M) (false negatives).
The literature presents some other record linkage algorithms, some of which are variations
of the ones presented here. For example, (Bacher et al., 2002) presents a method based on
cluster analysis. The results are similar to the ones of distance-based record linkage as cluster
analysis assigns objects (in this case, records) that are similar (in this case, near) to each other,
to the same cluster. The algorithms presented here permit two records of the intruder b
1
and
b
2
to be assigned to the same record a. There are algorithms that force different records in B
to be linked to different records in A.
The approaches described so far for record linkage do not use any information about the
data protection process. That is, they use ﬁles A and B and try to re-identify as much records
as possible. In this sense, they are general purpose record linkage algorithms.

In the last years, speciﬁc record linkage algorithms have been developed. They take ad-
vantage of any information available about the data protection procedure. That is, protection
procedures are analyzed in detail to ﬁnd ﬂaws that can be used for computing more efﬁ-
cient, with larger matching rates, record linkage algorithms. Attacks tailored for two protec-
tion procedures are reported in the literature. (Torra and Miyamoto, 2004) was the ﬁrst speciﬁc
35 Privacy in Data Mining 697
record linkage approach for microaggregation. More effective algorithms have been proposed
in (Nin and Torra, 2009, Nin et al., 2008b) (for either univariate and multivariate microaggre-
gation). (Nin et al., 2007) describes an algorithm for data protection using rank swapping.
The scenario described above can be relaxed so that the published ﬁle and the one of the
intruder do not share the set of variables. I.e., there are no common quasi-identiﬁers in the two
ﬁles. A few record linkage algorithms have been developed under this premise. In this case,
some structural information is assumed to be common in both ﬁles. (Torra, 2004) follows this
approach. Its use for disclosure risk assessment is described in (Domingo-Ferrer and Torra,
2003).
35.4 Data Protection Procedures
Protection methods can be classiﬁed into three different categories depending on how they
manipulate the original data to deﬁne the protected data set.
• Perturbative. The original data set is distorted in some way, and the new data set might
contain some erroneous information. E.g. noise is added to an attribute following a N(0,a)
for a given a. In this way, some combinations of values disappear, and, new combinations
appear in the protected data set. At the same time, combinations in the protected data set no
longer correspond to the ones in the original data set. This obfuscation makes disclosure
difﬁcult for intruders.
• Non-perturbative. Protection is achieved through replacing an original value by another
one that is not incorrect but less speciﬁc. For example, we replace a real number by an
interval. In general, non-perturbative methods reduce the level of detail of the data set.
This detail reduction causes different records to have the same combinations of values,
which makes disclosure difﬁcult to intruders.
• Synthetic Data Generators. In this case, instead of distorting the original data, new ar-

tiﬁcial data is generated and used to substitute the original values. Formally, synthetic
data generators build a data model from the original data set and, subsequently, a new
(protected) data set is randomly generated constrained by the model computed.
An alternative dimension to classify protection methods is based on the type of data. Basic
distinction is about numerical and categorical data, although other types of data (as e.g. time
series (Nin and Torra, 2006)), sequences of events for location privacy, logs, etc. have also
been considered in the literature.
• Numerical. As usual, an attribute is numerical when arithmetic operations as e.g. sub-
straction can be performed with it. Income and age are typical examples of such attributes.
With respect to disclosure risk, numerical values are likely to be unique in a database and,
therefore, leading to disclosure if no action is taken.
• Categorical. In this case, the attribute takes values over a ﬁnite set and standard numeri-
cal operations do not make sense. Ordinal and nominal scales are typically distinguished
among categorical attributes. In ordinal scales the order between values is relevant (e.g.
academic degree), whereas in nominal scales it is not (e.g. hair color). Therefore, max and
min operations are meaningful in ordinal scales but not on nominal scales.
Structured attributes is a subclass of categorical attributes. In this case, different cate-
gories are related in terms of subclasses or member of relationships. In some cases, a
hierarchy between categories can be inferred from these relationships. Cities, counties,
and provinces are typical examples of these hierarchical attributes. For some attributes,
the hierarchy is given but for others not but constructed by the protection procedure.
698 Vicenc¸ Torra
In the next sections we review some of the existing protection methods following the clas-
siﬁcation above. Some good reviews on data protection procedures are (Adam and Wortmann,
1989, Domingo-Ferrer and Torra, 2001a, Willenborg, 2001). In addition, we have a section
about k-anonymity. As we will see later, k-anonymity is not a protection method but a general
approach for avoiding disclosure up to a certain extent. Different instantiations exist, some
using perturbative and some using non-perturbative procedures.
In this section we will use X to denote the original data, X


to denote the protected data
set, and x
i,V
to represent the value of the attribute V in the ith record.
35.4.1 Perturbative Methods
In this section we review some of the perturbative methods. Among them, the ones that are
most used by the statistical agencies are rank swapping and microaggregation (Felso et al.,
2001), but the literature on privacy preserving data mining, more oriented to business-related
applications, largely focus on additive noise and microaggregation. Microaggregation and rank
swapping are simple and have a low computational cost. Most of the methods described in this
section, with some of their variants are implemented in the sdcMicro package in R (Templ,
2008) and in the
μ
-Argus software (Hundepool et al., 2003).
Rank Swapping
Rank swapping was originally proposed for ordinal attributes in (Moore, 1996), but also ap-
plied to numerical data in (Domingo-Ferrer and Torra, 2001b). It was classiﬁed in (Domingo-
Ferrer and Torra, 2001b) among the best microdata protection methods for numerical attributes
and in (Torra, 2004) among the best for categorical attributes.
Rank swapping is deﬁned for a single attribute V as described below. The application of
this method to a data ﬁle with several attributes is done attribute-wise, in a sequential way. The
algorithm depends on a parameter p that permits the user to control the amount of disclosure
risk. Normally, p corresponds to a percent of the total number of records in X.
• records of X (for the considered attribute V) are sorted in increasing order.
• Let us assume, for simplicity, that the records are already sorted and that (a
1
, ,a
n
) are
the sorted values in X. That is, a

i
≤ a

for all 1 ≤ i <≤n.
• Each value a
i
is swapped with another value a

, randomly and uniformly chosen from the
limited range i <≤ i + p.
• The sorting step is undone.
The algorithm shows that the smaller the p, the larger the risk. Note that when p increases
the difference between x
i
and x

may increase accordingly. Therefore, the risk decreases. Nev-
ertheless, in this case the differences between the original and the protected data set are higher,
so information loss increases.
(Nin et al., 2007) proves that speciﬁc attacks can be designed for this kind of rank swap-
ping and proposed two alternative algorithms where the swapping is not constrained to a spe-
ciﬁc interval. In this way, the range for swapping includes the whole set (a
1
, ,a
n
), although
farther data have small probability of being swapped. In this way, the intruder cannot take ad-
vantage of the closed interals in the attack. p-buckets and p-distribution rank swapping are the
names of such algorithms. Other variants of rank swapping include (Carlson and Salabasis,
2002) and (Takemura, 2002).

35 Privacy in Data Mining 699
Microaggregation
Microaggregation was originally (Defays and Nanopoulos, 1993) deﬁned for numerical
atributes (see also (Domingo and Mateo, 2002)) and later extended to categorical data (Torra,
2004) (see also (Domingo-Ferrer and Torra, 2005)) and to time series (Nin and Torra, 2006).
(Felso et al., 2001) shows that microaggregation is a method used by many statistical
agencies, and (Domingo-Ferrer and Torra, 2001b) shows that, for numerical data, is one of the
methods with a better trade-off between information loss and disclosure risk. (Torra, 2004)
describes its good performance in comparison with other methods for categorical data.
Microaggregation is operationally deﬁned in terms of two steps: partition and aggregation.
• Partition. Records are partitioned into several clusters, each of them consisting of at least
k records.
• Aggregation. For each of the clusters a representative (the centroid) is computed, and
then original records are replaced by the representative of the cluster to which they belong
to.
This approach permits protected data to satisfy privacy constraints, as all k records in the
cluster are replaced by the same value. In this way, k controls the privacy in the protected data.
We can formalize microaggregation using u
ij
to describe the partition of the records in
X. That is, u
ij
= 1 if record j is assigned to the ith cluster. Let v
i
be the representative of the
ith cluster, then a general formulation of microaggregation with g clusters and a given k is as
follows:
Minimize SSE =
∑
g

i=1
∑
n
j=1
u
ij
(d(x
j
,v
i
))
2
Subject to
∑
g
i=1
u
ij
= 1 for all j = 1, ,n
2k ≥
∑
n
j=1
u
ij
≥ k for all i = 1, ,g
u
ij
∈{0,1}
For numerical data it is usual to require that d(x,v) is the Euclidean distance. In the general

case, when attributes V =(V
1
, ,V
s
) are considered, x and v are vectors, and d becomes
d
2
(x,v)=
∑
V
i
∈V
(x
v
−v
V
i
)
2
. In addition, it is also common to require for numerical data that
v
i
is deﬁned as the arithmetic mean of the records in the cluster. I.e., v
i
=
∑
n
j=1
u
ij

x
i
/
∑
n
j=1
u
ij
.
In the case of univariate microaggregation (for Euclidean distance and arithmetic mean), there
exists algorithms to ﬁnd an optimal solution in polynomial time (Hansen and Mukherjee,
2003) (Algorithm 1 describes such method). In contrast, for multivariate data sets, the problem
becomes an NP-Hard (Oganian and Domingo-Ferrer, 2000). For this reason, heuristic methods
have been proposed in the literature.
The general formulation given above permits us to apply microaggregation to multidi-
mensional data. Nevertheless, when the number of attributes is large, it is usual to apply mi-
croaggregation to subsets of attributes; otherwise, the information loss is very high (Aggarwal,
2005). Individual ranking is a multivariate approach that consists of applying microaggrega-
tion to each of the attributes in an independent way. Alternatively, a partition of the attributes
is constructed and microaggregation is applied to each subset.
Applying microaggregation to subsets of attributes decrease information loss but at the
cost of increasing disclosure risk. See (Nin et al., 2008a) for an analysis of how to build these
partitions (i.e., whether it is preferable to select correlated or uncorrelated attributes when
deﬁning the partition) and their effect on information loss and disclsoure risk. (Nin et al.,
2008a) shows that the selection of uncorrelated attributes decrease disclosure risk and can
lead to a better trade-off between disclosure risk and information loss.

Data Mining and Knowledge Discovery Handbook, 2 Edition part 72 potx

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về