Data Mining and Knowledge Discovery Handbook, 2 Edition part 71 potx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (381.32 KB, 10 trang )

680 Grigorios Tsoumakas, Ioannis Katakis, and Ioannis Vlahavas
δ
(
λ
)=

1if
λ
/∈Y
i
0 otherwise
Coverage evaluates how far we need, on average, to go down the ranked list of labels in
order to cover all the relevant labels of the example.
Cov =
1
m
m
∑
i=1
max
λ
∈Y
i
r
i
(
λ
) −1
Ranking loss expresses the number of times that irrelevant labels are ranked higher than
relevant labels:
R-Loss =

1
m
m
∑
i=1
1
|Y
i
||Y
i
|
|{(
λ
a
,
λ
b
) : r
i
(
λ
a
) > r
i
(
λ
b
),(
λ
a

,
λ
b
) ∈Y
i
×Y
i
}|
where
Y
i
is the complementary set of Y
i
with respect to L.
Average precision evaluates the average fraction of labels ranked above a particular label
λ
∈Y
i
which actually are in Y
i
.
AvgPrec =
1
m
m
∑
i=1
1
|Y
i

|
∑
λ
∈Y
i
|{
λ

∈Y
i
: r
i
(
λ

) ≤ r
i
(
λ
)}|
r
i
(
λ
)
34.7.3 Hierarchical
The hierarchical loss (Cesa-Bianchi et al., 2006b) is a modiﬁed version of the Hamming loss
that takes into account an existing hierarchical structure of the labels. It examines the predicted
labels in a top-down manner according to the hierarchy and whenever the prediction for a label
is wrong, the subtree rooted at that node is not considered further in the calculation of the loss.

Let anc(
λ
) be the set of all the ancestor nodes of
λ
. The hierarchical loss is deﬁned as follows:
H-Loss =
1
m
m
∑
i=1
|{
λ
:
λ
∈Y
i
Z
i
,anc(
λ
) ∩(Y
i
Z
i
)=/0}|
Several other measures for hierarchical (multi-label) classiﬁcation are examined in
(Moskovitch et al., 2006,Sun & Lim, 2001).
34.8 Related Tasks
One of the most popular supervised learning tasks is multi-class classiﬁcation, which involves

a set of labels L, where |L|> 2. The critical difference with respect to multi-label classiﬁcation
is that each instance is associated with only one element of L, instead of a subset of L.
Jin and Ghahramani (Jin & Ghahramani, 2002) call multiple-label problems, the semi-
supervised classiﬁcation problems where each example is associated with more than one
classes, but only one of those classes is the true class of the example. This task is not that
common in real-world applications as the one we are studying.
Multiple-instance or multi-instance learning is a variation of supervised learning, where
labels are assigned to bags of instances (Maron & p Erez, 1998). In certain applications, the
training data can be considered as both multi-instance and multi-label (Zhou, 2007). In image
classiﬁcation for example, the different regions of an image can be considered as multiple-
instances, each of which can be labeled with a different concept, such as sunset and sea.
34 Mining Multi-label Data 681
Several methods have been recently proposed for addressing such data (Zhou & Zhang, 2006,
Zha et al., 2008).
In Multitask learning (Caruana, 1997) we try to solve many similar tasks in parallel usu-
ally using a shared representation. Taking advantage of the common characteristics of these
tasks a better generalization can be achieved. A typical example is to learn to identify hand
written text for different writers in parallel. Training data from one writer can aid the construc-
tion of better predictive models for other authors.
34.9 Multi-Label Data Mining Software
There exists a number of implementations of speciﬁc algorithms for mining multi-label data,
most of which have been discussed in Section 34.2.2. The BoosTexter system
6
, implements
the boosting-based approaches proposed in (Schapire, 2000). There also exist Matlab imple-
mentations for MLkNN
7
and BPMLL
8
.

There are also more general-purpose software that handle multi-label data as part of their
functionality. LibSVM (Chang & Lin, 2001) is a library for support vector machines that can
learn from multi-label data using the binary relevance transformation. Clus
9
is a predictive
clustering system that is based on decision tree learning. Its capabilities include (hierarchical)
multi-label classiﬁcation.
Finally, Mulan
10
is an open-source software devoted to multi-label data mining. It in-
cludes implementations of a large number of learning algorithms, basic capabilities for di-
mensionality reduction and hierarchical multi-label classiﬁcation and an extensive evaluation
framework.
References
Barutcuoglu, Z., Schapire, R. E. & Troyanskaya, O. G. (2006). Bioinformatics 22, 830–836.
Blockeel, H., Schietgat, L., Struyf, J., Dz?eroski, S. & Clare, A. (2006). Lecture Notes
in Computer Science (including subseries Lecture Notes in Artiﬁcial Intelligence and
Lecture Notes in Bioinformatics) 4213 LNAI, 18–29.
Boleda, G., im Walde, S. S. & Badia, T. (2007). In Proceedings of the 2007 Joint Confer-
ence on Empirical Methods in Natural Language Processing and Computational Natural
Language Learning pp. 171–180,, Prague.
Boutell, M., Luo, J., Shen, X. & Brown, C. (2004). Pattern Recognition 37, 1757–1771.
Brinker, K., F
¨
urnkranz, J. & H
¨
ullermeier, E. (2006). In Proceedings of the 17th European
Conference on Artiﬁcial Intelligence (ECAI ’06) pp. 489–493,, Riva del Garda, Italy.
Brinker, K. & H
¨

ullermeier, E. (2007). In Proceedings of the 20th International Conference
on Artiﬁcial Intelligence (IJCAI ’07) pp. 702–707,, Hyderabad, India.
Caruana, R. (1997). Machine Learning 28, 41–75.
6
schapire/boostexter.html
7
/>8
/>9
dtai/clus/
10
/>682 Grigorios Tsoumakas, Ioannis Katakis, and Ioannis Vlahavas
Cesa-Bianchi, N., Gentile, C. & Zaniboni, L. (2006a). In ICML ’06: Proceedings of the 23rd
international conference on Machine learning pp. 177–184,.
Cesa-Bianchi, N., Gentile, C. & Zaniboni, L. (2006b). Journal of Machine Learning Research
7, 31–54.
Chang, C C. & Lin, C J. (2001). LIBSVM: a library for support vector machines. Software
available at />˜
cjlin/libsvm.
Chawla, N. V., Japkowicz, N. & Kotcz, A. (2004). SIGKDD Explorations 6, 1–6.
Chen, W., Yan, J., Zhang, B., Chen, Z. & Yang, Q. (2007). In Proc. 7th IEEE International
Conference on Data Mining pp. 451–456, IEEE Computer Society, Los Alamitos, CA,
USA.
Clare, A. & King, R. (2001). In Proceedings of the 5th European Conference on Principles of
Data Mining and Knowledge Discovery (PKDD 2001) pp. 42–53,, Freiburg, Germany.
Crammer, K. & Singer, Y. (2003). Journal of Machine Learning Research 3, 1025–1058.
de Comite, F., Gilleron, R. & Tommasi, M. (2003). In Proceedings of the 3rd International
Conference on Machine Learning and Data Mining in Pattern Recognition (MLDM
2003) pp. 35–49,, Leipzig, Germany.
Diplaris, S., Tsoumakas, G., Mitkas, P. & Vlahavas, I. (2005). In Proceedings of the 10th
Panhellenic Conference on Informatics (PCI 2005) pp. 448–456,, Volos, Greece.

Elisseeff, A. & Weston, J. (2002). In Advances in Neural Information Processing Systems
14.
Esuli, A., Fagni, T. & Sebastiani, F. (2008). Information Retrieval 11, 287–313.
F
¨
urnkranz, J., H
¨
ullermeier, E., Mencia, E. L. & Brinker, K. (2008). Machine Learning .
Gao, S., Wu, W., Lee, C H. & Chua, T S. (2004). In Proceedings of the 21st international
conference on Machine learning (ICML ’04) p. 42,, Banff, Alberta, Canada.
Ghamrawi, N. & McCallum, A. (2005). In Proceedings of the 2005 ACM Conference on
Information and Knowledge Management (CIKM ’05) pp. 195–200,, Bremen, Germany.
Godbole, S. & Sarawagi, S. (2004). In Proceedings of the 8th Paciﬁc-Asia Conference on
Knowledge Discovery and Data Mining (PAKDD 2004) pp. 22–30,.
Harris, M. A., Clark, J., Ireland, A., Lomax, J., Ashburner, M., Foulger, R., Eilbeck, K.,
Lewis, S., Marshall, B., Mungall, C., Richter, J., Rubin, G. M., Blake, J. A., Bult, C.,
Dolan, M., Drabkin, H., Eppig, J. T., Hill, D. P., Ni, L., Ringwald, M., Balakrishnan, R.,
Cherry, J. M., Christie, K. R., Costanzo, M. C., Dwight, S. S., Engel, S., Fisk, D. G.,
Hirschman, J. E., Hong, E. L., Nash, R. S., Sethuraman, A., Theesfeld, C. L., Bot-
stein, D., Dolinski, K., Feierbach, B., Berardini, T., Mundodi, S., Rhee, S. Y., Apweiler,
R., Barrell, D., Camon, E., Dimmer, E., Lee, V., Chisholm, R., Gaudet, P., Kibbe, W.,
Kishore, R., Schwarz, E. M., Sternberg, P., Gwinn, M., Hannick, L., Wortman, J., Ber-
riman, M., Wood, V., de La, Tonellato, P., Jaiswal, P., Seigfried, T. & White, R. (2004).
Nucleic Acids Res 32.
H
¨
ullermeier, E., F
¨
urnkranz, J., Cheng, W. & Brinker, K. (2008). Artiﬁcial Intelligence 172,
1897–1916.

Ji, S., Tang, L., Yu, S. & Ye, J. (2008). In Proceedings of the 14th SIGKDD International
Conferece on Knowledge Discovery and Data Mining, Las Vegas, USA.
Jin, R. & Ghahramani, Z. (2002). In Proceedings of Neural Information Processing Systems
2002 (NIPS 2002), Vancouver, Canada.
Katakis, I., Tsoumakas, G. & Vlahavas, I. (2008). In Proceedings of the ECML/PKDD 2008
Discovery Challenge, Antwerp, Belgium.
Kohavi, R. & John, G. H. (1997). Artiﬁcial Intelligence 97, 273–324.
Lewis, D. D., Yang, Y., Rose, T. G. & Li, F. (2004). J. Mach. Learn. Res. 5, 361–397.
34 Mining Multi-label Data 683
Li, T. & Ogihara, M. (2003). In Proceedings of the International Symposium on Music
Information Retrieval pp. 239–240,, Washington D.C., USA.
Li, T. & Ogihara, M. (2006). IEEE Transactions on Multimedia 8, 564–574.
Loza Mencia, E. & F
¨
urnkranz, J. (2008a). In 2008 IEEE International Joint Conference on
Neural Networks (IJCNN-08) pp. 2900–2907,, Hong Kong.
Loza Mencia, E. & F
¨
urnkranz, J. (2008b). In 12th European Conference on Principles and
Practice of Knowledge Discovery in Databases, PKDD 2008 pp. 50–65,, Antwerp, Bel-
gium.
Luo, X. & Zincir-Heywood, A. (2005). In Proceedings of the 15th International Symposium
on Methodologies for Intelligent Systems pp. 161–169,.
Maron, O. & p Erez, T. A. L. (1998). In Advances in Neural Information Processing Systems
10 pp. 570–576, MIT Press.
McCallum, A. (1999). In Proceedings of the AAAI’ 99 Workshop on Text Learning.
Mencia, E. L. & F
¨
urnkranz, J. (2008). In 12th European Conference on Principles and
Practice of Knowledge Discovery in Databases, PKDD 2008, Antwerp, Belgium.

Moskovitch, R., Cohenkashi, S., Dror, U., Levy, I., Maimon, A. & Shahar, Y. (2006). Artiﬁ-
cial Intelligence in Medicine 37, 177–190.
Park, C. H. & Lee, M. (2008). Pattern Recogn. Lett. 29, 878–887.
Pestian, J. P., Brew, C., Matykiewicz, P., Hovermale, D. J., Johnson, N., Cohen, K. B. &
Duch, W. (2007). In BioNLP ’07: Proceedings of the Workshop on BioNLP 2007 pp.
97–104, Association for Computational Linguistics, Morristown, NJ, USA.
Qi, G J., Hua, X S., Rui, Y., Tang, J., Mei, T. & Zhang, H J. (2007). In MULTIMEDIA
’07: Proceedings of the 15th international conference on Multimedia pp. 17–26, ACM,
New York, NY, USA.
Read, J. (2008). In Proc. 2008 New Zealand Computer Science Research Student Conference
(NZCSRS 2008) pp. 143–150,.
Rokach L., Genetic algorithm-based feature set partitioning for classiﬁcation prob-
lems,Pattern Recognition, 41(5):1676–1700, 2008.
Rokach L., Mining manufacturing data using genetic algorithm-based feature set decompo-
sition, Int. J. Intelligent Systems Technologies and Applications, 4(1):57-78, 2008.
Rokach L., Maimon O. and Lavi I., Space Decomposition In Data Mining: A Clustering Ap-
proach, Proceedings of the 14th International Symposium On Methodologies For Intel-
ligent Systems, Maebashi, Japan, Lecture Notes in Computer Science, Springer-Verlag,
2003, pp. 24–31.
Rousu, J., Saunders, C., Szedmak, S. & Shawe-Taylor, J. (2006). Journal of Machine Learn-
ing Research 7, 1601–1626.
Ruepp, A., Zollner, A., Maier, D., Albermann, K., Hani, J., Mokrejs, M., Tetko, I., G
¨
uldener,
U., Mannhaupt, G., M
¨
unsterk
¨
otter, M. & Mewes, H. W. (2004). Nucleic Acids Res 32,
5539–5545.

Schapire, R.E. Singer, Y. (2000). Machine Learning 39, 135–168.
Snoek, C. G. M., Worring, M., van Gemert, J. C., Geusebroek, J M. & Smeulders, A. W. M.
(2006). In MULTIMEDIA ’06: Proceedings of the 14th annual ACM international con-
ference on Multimedia pp. 421–430, ACM, New York, NY, USA.
Spyromitros, E., Tsoumakas, G. & Vlahavas, I. (2008). In Proc. 5th Hellenic Conference on
Artiﬁcial Intelligence (SETN 2008).
Srivastava, A. & Zane-Ulman, B. (2005). In IEEE Aerospace Conference.
Streich, A. P. & Buhmann, J. M. (2008). In 12th European Conference on Principles and
Practice of Knowledge Discovery in Databases, PKDD 2008, Antwerp, Belgium.
684 Grigorios Tsoumakas, Ioannis Katakis, and Ioannis Vlahavas
Sun, A. & Lim, E P. (2001). In ICDM ’01: Proceedings of the 2001 IEEE International
Conference on Data Mining pp. 521–528, IEEE Computer Society, Washington, DC,
USA.
Sun, L., Ji, S. & Ye, J. (2008). In Proceedings of the 14th SIGKDD International Conferece
on Knowledge Discovery and Data Mining, Las Vegas, USA.
Thabtah, F., Cowling, P. & Peng, Y. (2004). In Proceedings of the 4th IEEE International
Conference on Data Mining, ICDM ’04 pp. 217–224,.
Trohidis, K., Tsoumakas, G., Kalliris, G. & Vlahavas, I. (2008). In Proc. 9th International
Conference on Music Information Retrieval (ISMIR 2008), Philadelphia, PA, USA,
2008.
Tsoumakas, G. & Katakis, I. (2007). International Journal of Data Warehousing and Mining
3, 1–13.
Tsoumakas, G., Katakis, I. & Vlahavas, I. (2008). In Proc. ECML/PKDD 2008 Workshop
on Mining Multidimensional Data (MMD’08) pp. 30–44,.
Tsoumakas, G. & Vlahavas, I. (2007). In Proceedings of the 18th European Conference on
Machine Learning (ECML 2007) pp. 406–417,, Warsaw, Poland.
Ueda, N. & Saito, K. (2003). Advances in Neural Information Processing Systems 15 ,
721–728.
Veloso, A., Wagner, M. J., Goncalves, M. & Zaki, M. (2007). In Proceedings of the 11th
European Conference on Principles and Practice of Knowledge Discovery in Databases

(PKDD 2007) vol. LNAI 4702, pp. 605–612, Springer, Warsaw, Poland.
Vembu, S. & G
¨
artner, T. (2009). In Preference Learning, (F
¨
urnkranz, J. & H
¨
ullermeier, E.,
eds),. Springer.
Vens, C., Struyf, J., Schietgat, L., D
ˇ
zeroski, S. & Blockeel, H. (2008). Machine Learning
73, 185–214.
Wieczorkowska, A., Synak, P. & Ras, Z. (2006). In Proceedings of the 2006 International
Conference on Intelligent Information Processing and Web Mining (IIPWM’06) pp.
307–315,.
Wolpert, D. (1992). Neural Networks 5, 241–259.
Yang, S., Kim, S K. & Ro, Y. M. (2007). Circuits and Systems for Video Technology, IEEE
Transactions on 17, 324–335.
Yang, Y. (1999). Journal of Information Retrieval 1, 67–88.
Yang, Y. & Pedersen, J. O. (1997). In Proceedings of ICML-97, 14th International Confer-
ence on Machine Learning, (Fisher, D. H., ed.), pp. 412–420, Morgan Kaufmann Pub-
lishers, San Francisco, US, Nashville, US.
Yu, K., Yu, S. & Tresp, V. (2005). In SIGIR ’05: Proceedings of the 28th annual international
ACM SIGIR conference on Research and development in information retrieval pp. 258–
265, ACM Press, Salvador, Brazil.
Zha, Z J., Hua, X S., Mei, T., Wang, J., Qi, G J. & Wang, Z. (2008). In Computer Vision
and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on pp. 1–8,.
Zhang, M L. & Zhou, Z H. (2006). IEEE Transactions on Knowledge and Data Engineering
18, 1338–1351.

Zhang, M L. & Zhou, Z H. (2007a). Pattern Recognition 40, 2038–2048.
Zhang, M L. & Zhou, Z H. (2007b). In Proceedings of the Twenty-Second AAAI Confer-
ence on Artiﬁcial Intelligence pp. 669–674, AAAI Press, Vancouver, Britiths Columbia,
Canada.
Zhang, Y., Burer, S. & Street, W. N. (2006). Journal of Machine Learning Research 7,
1315–1338.
34 Mining Multi-label Data 685
Zhang, Y. & Zhou, Z H. (2008). In Proceedings of the Twenty-Third AAAI Conference on
Artiﬁcial Intelligence, AAAI 2008 pp. 1503–1505, AAAI Press, Chicago, Illinois, USA.
Zhou, Z H. (2007). In Proceedings of the 3rd International Conference on Advanced Data
Mining and Applications (ADMA’07) p. 1. Springer.
Zhou, Z. H. & Zhang, M. L. (2006). In NIPS, (Sch
¨
olkopf, B., Platt, J. C. & Hoffman, T.,
eds), pp. 1609–1616, MIT Press.
Zhu, S., Ji, X., Xu, W. & Gong, Y. (2005). In Proceedings of the 28th annual international
ACM SIGIR conference on Research and development in Information Retrieval pp. 274–
281.

35
Privacy in Data Mining
Vicenc¸ Torra
IIIA - CSIC, Campus UAB s/n, 08193 Bellaterra Catalonia, Spain

Summary. In this chapter we describe the main tools for privacy in data mining. We present
an overview of the tools for protecting data, and then we focus on protection procedures.
Information loss and disclosure risk measures are also described.
35.1 Introduction
Data is nowadays gathered in large amounts by companies and national ofﬁces. This data is
often analyzed either using statistical methods or data mining ones. When such methods are

applied within the walls of the company that has gathered them, the danger of disclosure of
sensitive information might be limited. In contrast, when the analysis have to be performed by
third parties, privacy becomes a much more relevant issue.
To make matters worst, it is not uncommon the scenario where an analysis does not only
require data from a single data source, but from several data sources. This is the case of banks
looking for fraud detection and hospitals analyzing deseases and treatments. In the ﬁrst case,
data from several banks might help on fraud detection. Similarly, data from different hospitals
might help on the process of ﬁnding the causes of a bad response to a given treatment, or the
causes of a given desease.
Privacy-Preserving Data Mining (Aggarwal and Yu, 2008) (PPDM) and Statistical Dis-
closure Control (Willenborg, 2001, Domingo-Ferrer and Torra, 2001a) (SDC) are two related
ﬁelds with a similar interest on ensuring data privacy. Their goal is to avoid the disclosure of
sensitive or proprietary information to third parties.
Within these ﬁelds, several methods have been proposed for processing and analysing
data without compromising privacy, for releasing data ensuring some levels of data privacy;
measures and indices have been deﬁned for evaluating disclosure risk (that is, in what extent
data satisfy the privacy constraints), and data utility or information loss (that is, in what extent
the protected data is still useful for applications). In addition, tools have been proposed to
visualize and compare different approaches for data protection.
In this chapter we will review some of the existing methods and give an overview of the
measures. The structure of the chapter is as follows. In Section 35.2, we present a classiﬁ-
cation of protection procedures. In Section 35.3, we review different interpretations for risk
and give an overview of disclosure risk measures. In Section 35.4, we present major protection
procedures. Also in this section we review k-anonymity. Then, Section 35.5 is focused on how
O. Maimon, L. Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed.,
DOI 10.1007/978-0-387-09823-4_35, © Springer Science+Business Media, LLC 2010
688 Vicenc¸ Torra
to measure data utility and information loss. A few information loss measures are reviewed
there. The chapter ﬁnishes in Section 35.6 presenting different approaches for visualizing the
trade-off between risk and utility, or risk and information loss. Some conclusions close the

chapter.
35.2 On the Classiﬁcation of Protection Procedures
The literature on Privacy Preserving Data Mining (PPDM) and on Statistical Disclosure Con-
trol (SDC) is vast, and a large number of procedures for ensuring privacy have been proposed.
We classify them in two categories according to the prior knowledge the data owner has about
the usage of the data.
Data-driven or general purpose protection procedures. In this case, no speciﬁc analysis or
usage is foreseen for the data. The data owner does not know what kind of analysis will
be performed by the third party.
This is the case when data is released for public use, as there is no way to know what kind
of study a potential user will perform. This situation is common in National Statistical
Ofﬁces, where data obtained from census and questionnaires can be e.g. downloaded
from internet (census.gov). A similar case can occur for other public ofﬁces that publish
regularly data obtained from questionnaires. Another case is when data are transferred to
e.g. researchers so that they can analyse them. Hospitals and other healthcare institutions
can also be the target of such protection procedures, as they can be interested in protection
procedures that permit different researchers to apply different data analysis tools (e.g.,
regression, clustering, association rules).
Within data-driven procedures, subcategories can be distinguished according to the type
of data used. The main distinction about data types is between original dataﬁles (e.g.,
individuals described in terms of attributes) and aggregates of the data (e.g., contingency
tables). In the statistical disclosure control community, the former type corresponds to
microdata and the later to tabular data.
With respect to the type or structure of the original ﬁles, most of the research has been fo-
cused on standard ﬁles with numerical or categorical data (ordinal or nominal categorical
data). Nevertheless, other more complex types of data have also been considered in the
literature, as, e.g., multirelational databases, logs, and social networks. Another aspect to
be considered in relation to the structure of the ﬁles is about the constraints that the pro-
tected data needs to satisfy (e.g., when there is a linear combination of some variables).
Data protection methods need to consider such constraints so that the protected data also

satisﬁes them (see e.g. (Torra, 2008) for details on a classiﬁcation of the constraints and
a study of microaggregation under this light).
Computation-driven or speciﬁc purpose protection procedures. In this case it is known be-
forehand which type of analysis has to be applied to the data. As the data uses are
known, protection procedures are deﬁned according to the intented subsequent computa-
tion. Thus, protection procedures are tailored to a speciﬁc purpose.
This will be the case of a retailer with a commercial database with information on cus-
tomers having a ﬁdelity card, when such data has to be transferred to a third party for
market basket analysis. For example, there exist tailored procedures for data protection
for association rules. They can be applied in this context of market basket analysis.
Results-driven protection procedures. In this case, privacy concerns to the result of applying
a particular data mining method to some particular data (Atallah et al., 1999,Atzori et al.,
35 Privacy in Data Mining 689
2008). For example, the association rules obtained from a commercial database should
not permit the disclosure of sensitive information about particular customers.
Although this class of procedures can be seen as computation-driven, they are important
enough to deserve their own class. This class of methods are also known by anonymity
preserving pattern discovery (Atzori et al., 2008), result privacy (Bertino et al., 2008),
and output secrecy (Haritsa, 2008).
Other dimensions have been considered in the literature for classifying protection proce-
dures. One of them concerns the number of data sources.
Single data source. The data analysis only requires data from a single source.
Multiple data sources. Data from different sources have to be combined in order to compute
a certain analysis.
The analysis of data protection procedures for multiple data sources usually falls within
the computation-driven approach. A typical scenario in this setting is when a few com-
panies collaborate in a certain analysis, each one providing its own data base. In the
typical scenario within data privacy, data owners want to compute such analysis without
disclosing their own data to the other data owners. So, the goal is that at the end of the
analysis the only additional information obtained by each of the data owners is the result

of the analysis itself. That is, no extra knowledge should be acquired while computing
the analysis.
A trivial approach for solving this problem is to consider a trusted third party (TTP)
that computes the analysis. This is the centralized approach. In this case, data is just
transferred using a completely secure channel (i.e., using cryptographic protocols). In
contrast, in distributed privacy preserving data mining, data owners compute the analysis
in a collaborative manner. In this way, the trusted third party is not needed. For such
computation, cryptographic tools are also used.
Multiple data sources for data-driven protection procedures has limited interest. Each data
owner can publish its own data protected using general purpose protection procedures,
and then data can be linked (using e.g. record linkage algorithms) and ﬁnally analysed.
So, this roughly corresponds to multidatabase mining.
The literature often classiﬁes protection procedures using another dimension concerning
the type of tools used. That is, methods are classiﬁed either as following the perturbative
or the cryptographic approach. Our classiﬁcation given above encompasses these two ap-
proaches. General purpose protection procedures follow the so-called perturbative approach,
while computation-driven protection procedures mainly follow the cryptographic approach.
Note, however, that there are some papers on perturbative approaches as e.g. noise addi-
tion for speciﬁc uses as e.g. association rules (see (Atallah et al., 1999)). Nevertheless, such
methods are general enough to be used in other applications. So, they are general purpose
protection procedures.
In addition, it is important to underline that, in this chapter, we will not use the term
perturbative approach with the interpretation above. Instead, we will use the term perturbative
methods/approaches in a more restricteed way (see Section 35.4), as it is usual in the statistical
disclosure control community.
In the rest of this section we further discuss both computation-driven and data-driven
procedures.

Data Mining and Knowledge Discovery Handbook, 2 Edition part 71 potx

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về