Privacy in statistical databases UNESCO chair in data privacy international conference, PSD 2016

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (12.72 MB, 271 trang )

LNCS 9867

Josep Domingo-Ferrer
Mirjana Pejic-Bach (Eds.)

Privacy in
Statistical Databases
UNESCO Chair in Data Privacy
International Conference, PSD 2016
Dubrovnik, Croatia, September 14–16, 2016, Proceedings

123

Lecture Notes in Computer Science
Commenced Publication in 1973
Founding and Former Series Editors:
Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board
David Hutchison
Lancaster University, Lancaster, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Friedemann Mattern
ETH Zurich, Zürich, Switzerland
John C. Mitchell

Stanford University, Stanford, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
TU Dortmund University, Dortmund, Germany
Demetri Terzopoulos
University of California, Los Angeles, CA, USA
Doug Tygar
University of California, Berkeley, CA, USA
Gerhard Weikum
Max Planck Institute for Informatics, Saarbrücken, Germany

9867

More information about this series at />

Josep Domingo-Ferrer Mirjana Pejić-Bach (Eds.)
•

Privacy in
Statistical Databases
UNESCO Chair in Data Privacy
International Conference, PSD 2016
Dubrovnik, Croatia, September 14–16, 2016
Proceedings

123

Editors
Josep Domingo-Ferrer
Universitat Rovira i Virgili
Tarragona
Spain

Mirjana Pejić-Bach
University of Zagreb
Zagreb
Croatia

ISSN 0302-9743
ISSN 1611-3349 (electronic)
Lecture Notes in Computer Science
ISBN 978-3-319-45380-4
ISBN 978-3-319-45381-1 (eBook)
DOI 10.1007/978-3-319-45381-1
Library of Congress Control Number: 2016948609
LNCS Sublibrary: SL3 – Information Systems and Applications, incl. Internet/Web, and HCI
© Springer International Publishing Switzerland 2016
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the
material is concerned, speciﬁcally the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microﬁlms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now
known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a speciﬁc statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.

The publisher, the authors and the editors are safe to assume that the advice and information in this book are
believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors
give a warranty, express or implied, with respect to the material contained herein or for any errors or
omissions that may have been made.
Printed on acid-free paper
This Springer imprint is published by Springer Nature
The registered company is Springer International Publishing AG Switzerland

Preface

Privacy in statistical databases is a discipline whose purpose it is to provide solutions to
the tension between the social, political, economic, and corporate demand for accurate
information, and the legal and ethical obligation to protect the privacy of the various
parties involved. Those parties are the subjects, sometimes also known as respondents
(the individuals and enterprises to which the data refer), the data controllers (those
organizations collecting, curating, and to some extent sharing or releasing the data),
and the users (the ones querying the database or the search engine, who would like their
queries to stay conﬁdential). Beyond law and ethics, there are also practical reasons for
data controllers to invest in subject privacy: if individual subjects feel their privacy is
guaranteed, they are likely to provide more accurate responses. Data controller privacy
is primarily motivated by practical considerations: if an enterprise collects data at its
own expense and responsibility, it may wish to minimize leakage of those data to other
enterprises (even to those with whom joint data exploitation is planned). Finally, user
privacy results in increased user satisfaction, even if it may curtail the ability of the data
controller to proﬁle users.
There are at least two traditions in statistical database privacy, both of which started
in the 1970s: the ﬁrst one stems from ofﬁcial statistics, where the discipline is also
known as statistical disclosure control (SDC) or statistical disclosure limitation (SDL),
and the second one originates from computer science and database technology.

In ofﬁcial statistics, the basic concern is subject privacy. In computer science, the initial
motivation was also subject privacy but, from 2000 onwards, growing attention has
been devoted to controller privacy (privacy-preserving data mining) and user privacy
(private information retrieval). In the last few years, the interest and the achievements
of computer scientists in the topic have substantially increased, as reﬂected in the
contents of this volume. At the same time, the generalization of big data is challenging
privacy technologies in many ways: this volume also contains recent research aimed at
tackling some of these challenges.
“Privacy in Statistical Databases 2016” (PSD 2016) was held under the sponsorship
of the UNESCO Chair in Data Privacy, which has provided a stable umbrella for the
PSD biennial conference series since 2008. Previous PSD conferences were PSD 2014,
held in Eivissa; PSD 2012, held in Palermo; PSD 2010, held in Corfu; PSD 2008, held
in Istanbul; PSD 2006, the ﬁnal conference of the Eurostat-funded CENEX-SDC
project, held in Rome; and PSD 2004, the ﬁnal conference of the European FP5 CASC
project, held in Barcelona.
Proceedings of PSD 2014, PSD 2012, PSD 2010, PSD 2008, PSD 2006, and PSD
2004 were published by Springer in LNCS 8744, LNCS 7556, LNCS 6344, LNCS
5262, LNCS 4302, and LNCS 3050, respectively.
The seven PSD conferences held so far are a follow-up of a series of high-quality
technical conferences on SDC that started eighteen years ago with “Statistical Data
Protection-SDP’98”, held in Lisbon in 1998 and with proceedings published by

VI

Preface

OPOCE, and continued with the AMRADS project SDC Workshop, held in Luxemburg in 2001 and with proceedings published by Springer in LNCS 2316.
The PSD 2016 Program Committee accepted for publication in this volume
19 papers out of 35 submissions. Furthermore, 5 of the above submissions were

reviewed for short presentation at the conference and inclusion in the companion CD
proceedings. Papers came from 14 different countries and four different continents.
Each submitted paper received at least two reviews. The revised versions of the 19
accepted papers in this volume are a ﬁne blend of contributions from ofﬁcial statistics
and computer science.
Covered topics include tabular data protection, microdata and big data masking,
protection using privacy models, synthetic data, disclosure risk assessment, remote and
cloud access, and co-utile anonymization.
We are indebted to many people. First, to the Organization Committee for making
the conference possible and especially to Jesús A. Manjón, who helped prepare these
proceedings, and Goran Lesaja, who helped in the local arrangements. In evaluating the
papers we were assisted by the Program Committee and by Yu-Xiang Wang as an
external reviewer.
We also wish to thank all the authors of submitted papers and we apologize for
possible omissions.
Finally, we dedicate this volume to the memory of Dr Lawrence Cox, who was a
Program Committee member of all past editions of the PSD conference.
July 2016

Josep Domingo-Ferrer
Mirjana Pejić-Bach

Organization

Program Committee
Jane Bambauer
Bettina Berendt
Elisa Bertino
Aleksandra Bujnowska

Jordi Castro
Lawrence Cox
Josep Domingo-Ferrer
Jörg Drechsler
Mark Elliot
Stephen Fienberg
Sarah Giessing
Sara Hajian
Julia Lane
Bradley Malin
Oliver Mason
Laura McKenna
Gerome Miklau
Krishnamurty Muralidhar
Anna Oganian
Christine O’Keefe
Jerry Reiter
Yosef Rinott
Juan José Salazar
Pierangela Samarati
David Sánchez
Eric Schulte-Nordholt
Natalie Shlomo
Aleksandra Slavković
Jordi Soria-Comas
Tamir Tassa
Vicenç Torra
Vassilios Verykios
William E. Winkler
Peter-Paul de Wolf

University of Arizona, USA
Katholieke Universiteit Leuven, Belgium
CERIAS, Purdue University, USA
EUROSTAT, European Union
Polytechnical University of Catalonia, Catalonia
National Institute of Statistical Sciences, USA
Universitat Rovira i Virgili, Catalonia
IAB, Germany
Manchester University, UK
Carnegie Mellon University, USA
Destatis, Germany
Eurecat Technology Center, Catalonia
New York University, USA
Vanderbilt University, USA
National University of Ireland-Maynooth, Ireland
Census Bureau, USA
University of Massachusetts-Amherst, USA
The University of Oklahoma, USA
National Center for Health Statistics, USA
CSIRO, Australia
Duke University, USA
Hebrew University, Israel
University of La Laguna, Spain
University of Milan, Italy
Universitat Rovira i Virgili, Catalonia
Statistics Netherlands
University of Manchester, UK
Penn State University, USA
Universitat Rovira i Virgili, Catalonia

The Open University, Israel
Skövde University, Sweden
Hellenic Open University, Greece
Census Bureau, USA
Statistics Netherlands

VIII

Organization

Program Chair
Josep Domingo-Ferrer

UNESCO Chair in Data Privacy,
Universitat Rovira i Virgili, Catalonia

General Chair
Mirjana Pejić-Bach

Faculty of Business & Economics,
University of Zagreb, Croatia

Organization Committee
Vlasta Brunsko
Ksenija Dumicic
Joaquín García-Alfaro
Goran Lesaja
Jesús A. Manjón
Tamar Molina

Sara Ricci

Centre for Advanced Academic Studies,
University of Zagreb, Croatia
Faculty of Business & Economics,
University of Zagreb, Croatia
Télécom SudParis, France
Georgia Southern University, USA
Universitat Rovira i Virgili, Catalonia
Universitat Rovira i Virgili, Catalonia
Universitat Rovira i Virgili, Catalonia

Contents

Tabular Data Protection
Revisiting Interval Protection, a.k.a. Partial Cell Suppression,
for Tabular Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Jordi Castro and Anna Via

3

Precision Threshold and Noise: An Alternative Framework
of Sensitivity Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Darren Gray

15

Empirical Analysis of Sensitivity Rules: Cells with Frequency Exceeding
10 that Should Be Suppressed Based on Descriptive Statistics . . . . . . . . . . .

Kiyomi Shirakawa, Yutaka Abe, and Shinsuke Ito

28

A Second Order Cone Formulation of Continuous CTA Model. . . . . . . . . . .
Goran Lesaja, Jordi Castro, and Anna Oganian

41

Microdata and Big Data Masking
Anonymization in the Time of Big Data . . . . . . . . . . . . . . . . . . . . . . . . . .
Josep Domingo-Ferrer and Jordi Soria-Comas

57

Propensity Score Based Conditional Group Swapping for Disclosure
Limitation of Strata-Defining Variables . . . . . . . . . . . . . . . . . . . . . . . . . . .
Anna Oganian and Goran Lesaja

69

A Rule-Based Approach to Local Anonymization for Exclusivity Handling
in Statistical Databases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Jens Albrecht, Marc Fiedler, and Tim Kiefer

81

Perturbative Data Protection of Multivariate Nominal Datasets . . . . . . . . . . .
Mercedes Rodriguez-Garcia, David Sánchez, and Montserrat Batet

94

Spatial Smoothing and Statistical Disclosure Control . . . . . . . . . . . . . . . . . .
Edwin de Jonge and Peter-Paul de Wolf

107

Protection Using Privacy Models
On-Average KL-Privacy and Its Equivalence to Generalization
for Max-Entropy Mechanisms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Yu-Xiang Wang, Jing Lei, and Stephen E. Fienberg

121

X

Contents

Correcting Finite Sampling Issues in Entropy l-diversity . . . . . . . . . . . . . . .
Sebastian Stammler, Stefan Katzenbeisser, and Kay Hamacher

135

Synthetic Data
Creating an ‘Academic Use File’ Based on Descriptive Statistics: Synthetic
Microdata from the Perspective of Distribution Type . . . . . . . . . . . . . . . . . .
Kiyomi Shirakawa, Yutaka Abe, and Shinsuke Ito
COCOA: A Synthetic Data Generator for Testing Anonymization
Techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Vanessa Ayala-Rivera, A. Omar Portillo-Dominguez, Liam Murphy,
and Christina Thorpe

149

163

Remote and Cloud Access
Towards a National Remote Access System for Register-Based Research . . . .
Annu Cabrera
Accurate Estimation of Structural Equation Models with Remote
Partitioned Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Joshua Snoke, Timothy Brick, and Aleksandra Slavković
A New Algorithm for Protecting Aggregate Business Microdata
via a Remote System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Yue Ma, Yan-Xia Lin, James Chipperfield, John Newman,
and Victoria Leaver

181

190

210

Disclosure Risk Assessment
Rank-Based Record Linkage for Re-Identification Risk Assessment. . . . . . . .
Krishnamurty Muralidhar and Josep Domingo-Ferrer
Computational Issues in the Design of Transition Probabilities
and Disclosure Risk Estimation for Additive Noise . . . . . . . . . . . . . . . . . . .
Sarah Giessing

225

237

Co-utile Anonymization
Enabling Collaborative Privacy in User-Generated Emergency Reports . . . . .
Amna Qureshi, Helena Rifà-Pous, and David Megías

255

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

273

Tabular Data Protection

Revisiting Interval Protection, a.k.a. Partial Cell
Suppression, for Tabular Data
Jordi Castro1(B) and Anna Via2
1

2

Department of Statistics and Operations Research,
Universitat Polit`ecnica de Catalunya,
Jordi Girona 1–3, 08034 Barcelona, Catalonia, Spain

School of Mathematics and Statistics, Universitat Polit`ecnica de Catalunya,
Pau Gargallo 5, 08028 Barcelona, Catalonia, Spain

Abstract. Interval protection or partial cell suppression was introduced
in “M. Fischetti, J.-J. Salazar, Partial cell suppression: A new methodology for statistical disclosure control, Statistics and Computing, 13, 13–21,
2003” as a “linearization” of the diﬃcult cell suppression problem. Interval protection replaces some cells by intervals containing the original
cell value, unlike in cell suppression where the values are suppressed.
Although the resulting optimization problem is still huge—as in cell suppression, it is linear, thus allowing the application of eﬃcient procedures.
In this work we present preliminary results with a prototype implementation of Benders decomposition for interval protection. Although the
above seminal publication about partial cell suppression applied a similar methodology, our approach diﬀers in two aspects: (i) the boundaries of
the intervals are completely independent in our implementation, whereas
the one of 2003 solved a simpler variant where boundaries must satisfy
a certain ratio; (ii) our prototype is applied to a set of seven general
and hierarchical tables, whereas only three two-dimensional tables were
solved with the implementation of 2003.
Keywords: Statistical disclosure control · Tabular data · Interval
protection · Cell suppression · Linear optimization · Large-scale
optimization

1

Introduction

Post-tabular data protection methods are based on modifying or suppressing
some of the table cells, yet satisfying the table additivity (that is, the sum of the
“inner” cells has to be equal to the marginal cell) and preserving the original
value of a subset of cells (e.g., some subtotal or total cells). This is the main
Supported by grant MTM2015-65362-R of the Spanish Ministry of Economy and
Competitiveness.

c Springer International Publishing Switzerland 2016
J. Domingo-Ferrer and M. Peji´
c-Bach (Eds.): PSD 2016, LNCS 9867, pp. 3–14, 2016.
DOI: 10.1007/978-3-319-45381-1 1

4

J. Castro and A. Via

diﬀerence compared to pre-tabular methods, which at the same time cannot
guarantee table additivity and the original value of a subset of cells. Among
post-tabular data protection methods we ﬁnd cell suppression [4,9] and controlled tabular adjustment [1,3], both formulating diﬃcult mixed integer linear
optimization problems. More details can be found in the monograph [12] and
the survey [5].
Interval protection or partial cell suppression was introduced in [10] as a
linearization of the diﬃcult cell suppression problem. Unlike in cell suppression, interval protection replaces some cell values by intervals containing the
true value. From those intervals, no attacker can be able to recompute the
true value within some predeﬁned lower and upper protection levels. One of
the great advantages of interval suppression against alternative approaches is
that the resulting optimization problem is convex and continuous, which means
that theoretically it can be eﬃciently solved in polynomial time by, for instance,
interior-point methods [13]. Therefore, theoretically, this approach is valid for
big tables from the big-data era.
However, attempting to solve the resulting “monolithic” linear optimization
model by some state-of-the-art solver is almost impossible for huge tables: we will
either exhaust the RAM memory of the computer, or we will require a large CPU
time. Alternative approaches to be tried include a Benders decomposition of this
huge linear optimization problem. In this work we present preliminary results
with a prototype implementation of Benders decomposition. A similar approach

was used in the seminal publication [10] about partial cell suppression. However,
this work diﬀers in two substantial aspects: (i) our implementation considers two
independent boundaries for each cell interval, whereas those two boundaries were
forced to satisfy a ratio in the code of [10] (that is, actually only one boundary
was considered in the 2003 code, thus solving a simpler variant of the problem);
(ii) we applied our prototype to a set of seven general and hierarchical tables,
where results for only three two-dimensional tables were reported in [10]. As
we will see, our “not-too eﬃcient and tuned” classical Benders decomposition
prototype still outperforms state-of-the-art solvers in these complex tables.
The paper is organized as follows. Section 2 describes the general interval protection method. Section 3 outlines the Benders solution approach. The particular
form of Benders for interval protection is shown in Sect. 4, which is illustrated
by a small example in Subsect. 4.1. Finally, Sect. 5 reports computational results
with some general and hierarchical tables.

2

The General Interval Protection Problem Formulation

We are given a table (i.e., a set of cells ai , i ∈ N = {1, . . . , n}), satisfying m
linear relations Aa = b, A ∈ Rm×n , b ∈ Rm . Any set of values x satisfying
Ax = b, l ≤ x ≤ u, is a valid table, l ∈ Rn , u ∈ Rn being known a priori
lower and upper bounds for cell values. For positive tables we have li = 0, ui =
+∞, i = 1, . . . , n, but the procedure here outlined is also valid for general tables.

Revisiting Interval Protection

5

For instance, we may consider the cells provide information about some attribute

for several individual states (e.g., member states of European Union), as well
as the highest-level of aggregated information (e.g., at European Union level).
The set of multi-state cells, or cells providing this highest-level of aggregated
information could be the ones to be replaced by intervals, and they will be
denoted as H ⊆ N .
Let F, S, M be a partition of N , i.e., N = F ∪ S ∪ M, and F ∩ S = F ∩ M =
S ∩ M = ∅. S is the set of sensitive cells to be protected, with upper and lower
protection levels upls and lpls for each cell s ∈ S. F is the set of cells whose
values are known (e.g., they have been previously published by individual states).
M is the set of non-sensitive and non previously published cells. To simplify the
formulation of the forthcoming optimization problems, we can assume that for
f ∈ F we have lf = uf = af , and then cells from F can be considered elements
of M, that is, M ← M ∪ F and F ← ∅. Following our example, we have that,
in general, cells in S provide information at state level, but in some cases multistate cells may also be sensitive; thus we may have S ∩ H = ∅. In a similar way,
since multi-state cells may not have been previously published we may also have
M ∩ H = ∅. To make the formulation more general our only assumption will be
that H ⊆ N . When H = N we just have the standard “interval protection” or
“partial cell suppression” introduced in [10].
Our purpose is to publish the set of smallest intervals [lbh , ubh ]—where
lh ≤ lbh and ubh ≤ uh — for each cell h ∈ H instead of the real value
ah ∈ [lbh , ubh ], such that, from these intervals, no attacker can determine that
as ∈ (as − lpls , as + upls ) for all sensitive cells s ∈ S. This means that
as ≤ as − lpls

and as ≥ as + upls ,

(1)

as and as being deﬁned as
as = min xs

s.to Ax = b
li ≤ xi ≤ ui i ∈ N \ H
lbi ≤ xi ≤ ubi i ∈ H

and

as = max xs
s.to Ax = b
(2)
li ≤ xi ≤ ui i ∈ N \ H
lbi ≤ xi ≤ ubi i ∈ H

Clearly, for cells i ∈ H ∩ S, (1) and (2) imply that lbi ≤ ai − lpli and ubi ≥
ai + upli .
The previous problem can be formulated as a large-scale linear optimization
problem. For each primary cell s ∈ S, two auxiliary vectors xl,s ∈ Rn and
xu,s ∈ Rn are introduced to impose, respectively, the lower and upper protection
requirement of (1). The problem formulation is as follows:

6

J. Castro and A. Via

wi (ubi − lbi )

min
s.to

i∈H

Axl,s
li ≤ xl,s
i
lbi ≤ xl,s
i
xl,s
s
u,s

Ax
li ≤ xu,s
i
lbi ≤ xu,s
i
xu,s
s

⎫
=b
⎪
⎪
⎪
≤ ui
i ∈ N \ H⎪
⎪
⎪
⎪
⎪
⎪

≤ ubi
i∈H
⎪
⎪
⎪
⎬
≤ as − lpls
=b
≤ ui
≤ ubi
≥ as + upls

⎪
⎪
⎪
⎪
⎪
⎪
i ∈ N \ H⎪
⎪
⎪
⎪
⎪
i∈H
⎪
⎭

∀s∈S

(3)

li ≤ lbi ≤ ai i ∈ H
ai ≤ ubi ≤ ui i ∈ H
where wi is a weight for the information loss associated with cell ai .
Problem (3) is very large (easily in the order of millions of variables and
constraints), but it is linear (no binary, no integer variables), and thus theoretically it can be eﬃciently solved in polynomial time by general or by specialized
interior-point algorithms [7,13].

3

Outline of Benders Decomposition

Benders decomposition [2] was suggested for problems with two types of
variables, one of them considered as “complicating variables”. In MILP models complicating variables are the binary/integer ones; in continuous problems,
the complicating variables are usually associated to linking variables between
groups of constraints (i.e., variables lb and ub in (3)). Consider the following
primal problem (P ) with two groups of variables (x, y)

(P )

min
s. to

c x+d y
A1 x + A2 y = b
x≥0
y ∈ Y,

where y are the complicating variables, c, x ∈ Rn1 , d, y ∈ Rn2 , A1 ∈ Rm×n1 and
A2 ∈ Rm×n2 . Fixing some y ∈ Y , we obtain:

(Q)

min
s. to

c x
A1 x = b − A2 y
x ≥ 0.

Revisiting Interval Protection

7

The dual of (Q) is:
(QD )

max
s. to

u (b − A2 y)
A1 u ≤ c
u ∈ Rm .

It is known that if (QD ) has a solution then (Q) has a solution too, and both
objective functions coincide; if (QD ) is unbounded, then (Q) is infeasible. Let
assume that (QD ) is never infeasible (in the interval protection problem this is
always the case). If, as notation convention, we consider that the objective of
(Q) is +∞ when it is infeasible, then (P ) can be written as
(P )

min
s. to

d y + max u (b − A2 y)|A1 u ≤ c, u ∈ Rm
y ∈ Y.

Let U = u|A1 u ≤ c, u ∈ Rm be the convex feasible set of (QD ). By
Minkowski representation we know that every point u ∈ U may be represented
as a convex combination of the vertices u1 , . . . , us and extreme rays v 1 , . . . , v t of
the convex polytope U . Therefore any u ∈ U may be written as
u=

s
i=1

t

λi ui + j=1 μj v j
s
i=1 λi = 1
λi ≥ 0 i = 1, . . . , s
μj ≥ 0 j = 1, . . . , t.

If v j (b − A2 y) > 0 for some j ∈ {1, . . . , t} then (QD ) is unbounded, and thus
(Q) is infeasible. We then impose
v j (b − A2 y) ≤ 0

j = 1, . . . , t.

The optimal solution of (QD ) is then known to be in a vertex of U , and (P )
may be rewritten as

(P )

min

d y + max (ui (b − A2 y))

s. to

v j (b − A2 y) ≤ 0 j = 1, . . . , t
y ∈ Y.

i=1,...,s

Introducing variable θ, (P ) is equivalent to the Benders problem (BP ):

(BP )

min
s. to

θ
θ ≥ d y + ui (b − A2 y)
v j (b − A2 y) ≤ 0
y ∈ Y.

i = 1, . . . , s
j = 1, . . . , t

Problem (BP ) is impractical since s and t can be very large, and in addition
the vertices and extreme rays are unknown. Instead, the method considers a

8

J. Castro and A. Via

relaxation (BPr ) with a subset of the vertices and extreme rays. The relaxed
Benders problem (or master problem) is thus:

(BPr )

min
s. to

θ
θ ≥ d y + ui (b − A2 y)
v j (b − A2 y) ≤ 0
y ∈ Y.

i ∈ I ⊆ {1, . . . , s}
j ∈ J ⊆ {1, . . . , t}

Initially I = J = ∅, and new vertices and extreme rays provided by the subproblem (QD ) are added to the master problem, until the optimal solution is found.
In summary, the steps of the Benders algorithm are:
Benders algorithm
0. Initially I = ∅ and J = ∅. Let (θr∗ ,yr∗ ) be the solution of current master
problem (BPr ), and (θ∗ ,y ∗ ) the optimal solution of (BP ).

1. Solve master problem (BPr ) obtaining θr∗ and yr∗ . At ﬁrst iteration, θr∗ = −∞
and yr is any feasible point in Y .
2. Solve subproblem (QD ) using y = yr∗ . There are two cases:
(a) (QD ) has ﬁnite optimal solution in vertex ui0 .
– If θr∗ = d yr∗ +ui0 (b−A2 yr∗ ) then STOP. Optimal solution is y ∗ = yr∗
with cost θ∗ = θr∗ .
– If θr∗ < d yr∗ + ui0 (b − A2 yr∗ ) then this solution violates constraint
of (BP ) θ > d y + ui0 (b − A2 y). Add this new constraint to (BPr ):
I ← I ∪ {i0 }.
(b) (QD ) is unbounded along segment ui0 + λv j0 (ui0 is current vertex, v j0
is extreme ray). Then this solution violates constraint of (BP ) v j0 (b −
A2 w) ≤ 0. Add this new constraint to (BPr ): J ← J ∪ {j0 }; vertex may
also be added: I ← I ∪ {i0 }.
3. Go to step 1 above.
Convergence is guaranteed since at each iteration one or two constraints
are added to (BPr ), no constraints are repeated, and the maximum number of
constraints is s + t.

4

Benders Decomposition for the Interval Protection
Problem

Problem (3) has two groups of variables: xl,s ∈ Rn , xu,s ∈ Rn ; and lb ∈ R|H| ,
ub ∈ R|H| , which can be seen as the complicating variables, since if they are
ﬁxed, the resulting problem in variables xl,s and xu,s is separable, as shown
below. Indeed, projecting out the xl,s , xu,s variables, (3) can be written as

Revisiting Interval Protection

9

wi (ubi − lbi ) + Q(ub, lb)

min
i∈H

(4)

s.to li ≤ lbi ≤ ai i ∈ H
ai ≤ ubi ≤ ui i ∈ H
where
(0n xl,s + 0n xu,s ) = 0

Q(ub, lb) = min
s.to

s∈S

Axl,s
li ≤ xl,s
i
lbi ≤ xl,s
i
xl,s
s

⎫
=b

⎪
⎪
⎪
≤ ui
i ∈ N \ H⎪
⎪
⎪
⎪
⎪
⎪
≤ ubi
i∈H
⎪
⎪
⎪
⎬
≤ as − lpls

Axu,s
li ≤ xu,s
i
lbi ≤ xu,s
i
xu,s
s

⎪
⎪
⎪
=b

⎪
⎪
⎪
≤ ui
i ∈ N \ H⎪
⎪
⎪
⎪
⎪
≤ ubi
i∈H
⎪
⎭
≥ as + upls

∀ s ∈ S,

(5)

0n ∈ Rn denoting the zero vector. Problem (5) is separable in the xl,s , xu,s
variables for each s ∈ S so it can be replaced by the solution of 2|S| smaller
problems of the form
Ql,s (ub, lb) = min 0n xl,s = 0
s.to
Axl,s
li ≤ xl,s
i
lbi ≤ xl,s
i
xl,s

s

=b
≤ ui
i∈N \H
≤ ubi
i∈H
≤ as − lpls ,

(6)

for the lower protection of sensitive cell s ∈ S, and
Qu,s (ub, lb) = min 0n xu,s = 0
s.to
Axu,s
li ≤ xu,s
i
lbi ≤ xu,s
i
xu,s
s

=b
≤ ui
i∈N \H
≤ ubi
i∈H
≥ as + upls .

(7)

for the upper protection of sensitive cell s ∈ S. Note that (5)–(7) are just feasibility problems with a constant (dummy) objective function.
Problems (6) and (7) are our Benders subproblems. Due to its constant objective function, (6) and (7) are feasibility problems. Therefore Benders algorithm
will only include extreme rays of the dual formulations of (6) and (7) to guarantee
the feasibility of the values of lb and ub provided by the master problem.
j
=
Denoting the j-th extreme ray of the dual formulation of (6) as vl,s
μl,s

μl,s

l,s
l,s
refer to the indices of the
(vjλ , vj l , vj u , vjν ), where λl,s , μl,s
l , μu and ν
l,s

l,s

10

J. Castro and A. Via

Lagrange multipliers of the constraints of (6), it can be shown that the feasibility
cut to be added to the master problem would be
m

0≥
i=1

j
= gl,s
(ub, lb).

l,s

μl,s

l,s

λ
vj,i
bi +
i∈N \H

μl,s

μ

(−vj,iu ui + vj,il li ) +

i∈H

l,s

μ

(−vj,iu ubi + vj,il lbi ) − (as − lpls )vjν

The extreme rays of the dual of (7) have an analogous form
μu,s

u,s

μu,s

(vjλ , vj l , vj u , vjν
problem:
m

0≥
i=1

u,s

j
= gu,s
(ub, lb).

i∈N \H

(8)
=

) and so does the feasibility cut to be added to the master
u,s

μu,s

u,s

λ
vj,i
bi +

j
vu,s

l,s

μu,s

μ

(−vj,iu ui + vj,il li ) +

i∈H

u,s

μ

(−vj,iu ubi + vj,il lbi ) + (as − lpls )vjν

u,s

(9)

Denoting as Il,s and Iu,s the set of indices of feasibility cuts obtained from
Ql,s and Qu,s , the master problem is:
wi (ubi − lbi )
i∈H
gjl,s (ub, lb) ≤ 0
gju,s (ub, lb) ≤ 0

min
s.to

li ≤ lbi ≤ ai
ai ≤ ubi ≤ ui

j ∈ Il,s
j ∈ Iu,s
i∈H
i ∈ H.

(10)

The Benders decomposition algorithm will then solve (10) for the master problem
and the duals of (6) and (7) for the subproblems.
4.1

Illustrative Example

Consider the following simple table
10 15 25
20 17 37

of n = 6 cells and m = 2 linear constraints associated to row totals
a1 + a2 − a3 = 0
a4 + a5 − a6 = 0
(we don’t consider column totals to simplify the example), where H = N =
{1, . . . , 6}, and a1 and a5 as the two sensitive cells, whose parameters are given
by
s
1
5

as
10
17

lpls
5
7

upls
5 .
4

Revisiting Interval Protection

11

Note that this example, in principle, can not be solved with the original implementation of [10] since the ratios between upper and lower protection levels are
not the same for all sensitive cells.
We next show the application of Benders algorithm to the previous table:

1. Initialization.
The number of cuts for the lb and the ub variables is set to 0, this means
Il,s = Iu,s = ∅. The ﬁrst master problem to be solved is thus
6

min i=1 (ubi − lbi )
i = 1, . . . , 6
s.to li ≤ lbi ≤ ai
ai ≤ ubi ≤ ui i = 1, . . . , 6,
obtaining some initial values for lb, ub.
2. Iterating Through Benders’ Algorithm.
Cut generation is based on (8)–(9), details are omitted to simplify the
exposition.
– Iteration 1. The two Benders cuts obtained for cell 1 are lb1 ≤ 5 and
ub1 ≥ 15. The two Benders cuts obtained for cell 5 are lb5 ≤ 10 and
ub1 ≥ 21. Note these are obvious cuts associated to the protection levels
of sensitive cells, that could have been added from the beginning in an
eﬃcient implementation, thus avoiding this ﬁrst Benders iteration.
– Iteration 2. The current master subproblem
6

min i=1 (ubi − lbi )
i ∈ 1, . . . , 6
s.to li ≤ lbi ≤ ai
ai ≤ ubi ≤ ui i ∈ 1, . . . , 6
ub1 ≥ 15
lb1 ≤ 5,
ub5 ≥ 21
lb5 ≤ 10,
has solution lb = [5, 15, 25, 20, 10, 37] and ub = [15, 15, 25, 20, 21, 37]. Using

this solution the two Benders cuts obtained for cell 1 are lb3 − ub2 ≤ 58
and lb2 − ub2 ≥ 15. The two cuts obtained for cell 5 are lb6 − ub4 ≤ 21 and
ub6 − lb4 ≥ 39.
– Iteration 3. The current master problem is
6

min i=1 (ubi − lbi )
i ∈ 1, . . . , 6
s.to li ≤ lbi ≤ ai
ai ≤ ubi ≤ ui i ∈ 1, . . . , 6
ub1 ≥ 15
lb1 ≤ 5,
ub5 ≥ 21
lb5 ≤ 10,
lb3 − ub2 ≤ 58, lb2 − ub2 ≥ 15
lb6 − ub4 ≤ 21, ub6 − lb4 ≥ 39,
with solution lb = [5, 15, 20, 16, 10, 30] and ub = [15, 15, 30, 20, 21, 37]. Benders subproblems happen to be feasible with these values, thus we have

12

J. Castro and A. Via
6

an optimal solution of objective i=1 (ubi − lbi ) = 42. Since this table is
small, the original model was solved using some oﬀ-the-shelf optimization
solver, obtaining the same optimal objective function.
3. Auditing. Although this step is not needed with interval protection, to be
sure that this solution satisﬁes that no attacker can determine that as ∈
(as − lpls , as + upls ) for s ∈ {1, 5}, the problems (2) were solved, obtaining

a1 = 5, a1 = 15, a5 = 10 and a5 = 21. Therefore, it can be asserted that it is
safe to publish this solution.
4. Publication of the table. The ﬁnal safe table to be published would be
[5, 15]
15 [20, 30]
.
[16, 20] [10, 21] [30, 37]

5

Computational Results

We developed a prototype implementation of the Benders algorithm for interval
protection using the AMPL modeling language [11] and Cplex for the master and
subproblems. We solved seven instances, whose dimensions are given in Table 1.
Columns n, |S| and m provide, respectively, the number of cells, sensitive cells
and table linear equations. Table “targus” is a general table, while the remaining
six tables are 1H2D tables (i.e., two-dimensional hierarchical tables with one
hierarchical variable) obtained with a generator used in the literature [1,8].
Table 1. Instance dimensions and results with Benders decomposition
Table

n

|S|

m

CPU

itB itS

obj

Targus

162

13

63

5.17 31

8872

2142265.7

Table 1

121

10

55

3.41 26

7167

136924

Table 2 1680 158 299 410.53 43

1104884 43715149

Table 3

600

53 170

26.38 43

131834

3624906

Table 4

756

68 243

50.92 33

144963

9134139

Table 5

168

14

5959

303844

62

3.95 19

Table 6 1584 143 485 966.28 70

1729767 21302104

The results obtained with the Benders decomposition are provided in the last
columns of Table 1. Columns “CPU”, “itB ”, “itS ” and “obj” provide respectively
the total CPU time, number of Benders iterations, overall number of simplex
iterations, and the ﬁnal optimal objective function obtained.
Table 2 provides results for the solution of the monolithic model (3) using
Cplex default linear algorithm (dual simplex). Column “n.var” reports the number of variables of the resulting linear optimization problem. The meaning of
remaining columns is the same as in Table 1. Three executions, clearly marked,
were aborted because the CPU time was excessive compared with the solution

Revisiting Interval Protection

13

Table 2. Results using Cplex for monolithic model
Table

CPU

itS

n.var

obj

Targus 36.0515

16532

4212

2142265.7

Table 1 3.43548

7452

2420

136924

Table 2 2944.87

a

—

Table 3 522.875a —
Table 4 11085.6
Table 5 10.6764

530880 16056608400
63600

260592812

436895 102816

9134139

17325

4704

303844

Table 6 7816.61a —
453024 4404161015
a
Aborted due to excessive CPU time

by Benders; in those cases column “obj” provides the value of the objective

function when the algorithm was stopped. From these tables it is clear that the
solution of the monolithic model is impractical and that an standard implementation of Benders can be more eﬃcient for some classes of problems (namely,
1H2D tables).

6

Conclusions

Partial cell suppression or interval protection can be an alternative method for
tabular data protection. Unlike other approaches, this method results in a huge
but continuous optimization problem, which can be eﬀectively solved by linear
optimization algorithms. One of them is Benders decomposition: a prototype
code was able to solve some nontrivial tables more eﬃciently than state-of-the-art
solvers applied to the monolithic model. It is expected that a more sophisticated
implementation of Benders algorithm would be able to solve even larger and
more complex tables. An additional and promising line of research would be
to consider highly eﬃcient specialized interior-point methods for block-angular
problems [6,7]. This is part of the further work to be done.

References
1. Baena, D., Castro, J., Gonz´
alez, J.A.: Fix-and-relax approaches for controlled tabular adjustment. Comput. Oper. Res. 58, 41–52 (2015)
2. Benders, J.F.: Partitioning procedures for solving mixed-variables programming
problems. Comput. Manag. Sci. 2, 3–19 (2005). English translation of the original
paper appeared in Numerische Mathematik 4, 238–252 (1962)
3. Castro, J.: Minimum-distance controlled perturbation methods for large-scale tabular data protection. Eur. J. Oper. Res. 171, 39–52 (2006)
4. Castro, J.: A shortest paths heuristic for statistical disclosure control in positive
tables. INFORMS J. Comput. 19, 520–533 (2007)
5. Castro, J.: Recent advances in optimization techniques for statistical tabular data
protection. Eur. J. Oper. Res. 216, 257–269 (2012)

14

J. Castro and A. Via

6. Castro, J.: Interior-point solver for convex separable block-angular problems.
Optim. Methods Softw. 31, 88–109 (2016)
7. Castro, J., Cuesta, J.: Quadratic regularizations in an interior-point method for
primal block-angular problems. Math. Program. 130, 415–445 (2011)
8. Castro, J., Frangioni, A., Gentile, C.: Perspective reformulations of the CTA problem with L2 distances. Oper. Res. 62, 891–909 (2014)
9. Fischetti, M., Salazar, J.J.: Solving the cell suppression problem on tabular data
with linear constraints. Manag. Sci. 47, 1008–1026 (2001)
10. Fischetti, M., Salazar, J.J.: Partial cell suppression: a new methodology for statistical disclosure control. Stat. Comput. 13, 13–21 (2003)
11. Fourer, R., Gay, D.M., Kernighan, D.W.: AMPL: A Modeling Language for Mathematical Programming, 2nd edn. Thomson Brooks/Cole, Paciﬁc Grove (2003)
12. Hundepool, A., Domingo-Ferrer, J., Franconi, L., Giessing, S., Schulte-Nordholt,
E., Spicer, K., de Wolf, P.P.: Statistical Disclosure Control. Wiley, Chichester
(2012)
13. Wright, S.J.: Primal-Dual Interior-Point Methods. SIAM, Philadelphia (1997)

Precision Threshold and Noise: An Alternative
Framework of Sensitivity Measures
Darren Gray(B)
Statistics Canada, Ottawa, Canada

Abstract. At many national statistical organizations, linear sensitivity
measures such as the prior-posterior and dominance rules provide the
basis for assessing statistical disclosure risk in tabular magnitude data.

However, these measures are not always well-suited for issues present in
survey data such as negative values, respondent waivers and sampling
weights. In order to address this gap, this paper introduces the Precision Threshold and Noise framework, defining a new class of sensitivity
measures. These measures expand upon existing theory by relaxing certain restrictions, providing a powerful, flexible and functional tool for
national statistical organizations in the assessment of disclosure risk.
Keywords: Statistical disclosure control · Linear sensitivity rules
Prior-posterior rule · pq rule · P T N sensitivity · Precision threshold
Noise

1

·
·

Introduction

Most, if not all National Statistical Organizations (NSOs) are required by law to
protect the conﬁdentiality of respondents and ensure that the information they
provide is protected against statistical disclosure. For tables of magnitude data
totals, established sensitivity rules such as the prior-posterior and dominance
rules (also referred to as the pq and nk rules) are frequently used to assess
disclosure risk. The status of a cell (with respect to these rules) can be assessed
using a linear sensitivity measure of the form
αr xr

S=

(1)

r

for a non-negative non-ascending ﬁnite input variable xr (usually respondent
contributions) and non-ascending ﬁnite coeﬃcients αr (determined by the choice
of sensitivity rule). The cell is considered sensitive (i.e., at risk of disclosure) if
S > 0 and safe otherwise.1
1

Many NSOs have developed software to assess disclosure risk in tabular data; for
examples please see [3, 8]. For a detailed description of the prior posterior and dominance rules, we refer the reader to [4]; Chap. 4 gives an in-depth description of the
rules, with examples. The expression of these rules as linear measures is given in [1]
and [7, Chap. 6].

c Springer International Publishing Switzerland 2016
J. Domingo-Ferrer and M. Peji´
c-Bach (Eds.): PSD 2016, LNCS 9867, pp. 15–27, 2016.
DOI: 10.1007/978-3-319-45381-1 2

Privacy in statistical databases UNESCO chair in data privacy international conference, PSD 2016

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về