Security Methods for
Statistical Databases
Introduction
§
§
§
Statistical Databases containing medical information
are often used for research
Some of the data is protected by laws to help protect
the privacy of the patient
Proper security precautions must be implemented to
comply with laws and respect the sensitivity of the
data
Accuracy vs. Confidentiality
Accuracy –
Confidentiality –
Researchers
want to extract
accurate and
meaningful data
Patients, laws
and database
administrators
want to maintain
the privacy of
patients and the
confidentiality of
their information
Laws
§
§
§
§
§
Health Insurance Portability and Accountability Act
– HIPAA (Privacy Rule)
Covered organizations must comply by April 14, 2003
Designed to improve efficiency of healthcare system by
using electronic exchange of data and maintaining security
Covered entities (health plans, healthcare clearinghouses,
healthcare providers) may not use or disclose protected
information except as permitted or required
Privacy Rule establishes a “minimum necessary standard”
for the purpose of making covered entities evaluate their
current regulations and security precautions
HIPAA Compliance
§
§
§
Companies offer
covered entities
3rd
Party
Certification
of
Such companies will check your company and
associating companies for compliance with
HIPAA
Can help with rapid implementation
compliance to HIPAA regulations
and
Types of Statistical Databases
§
§
Static – a static
database is made
once and never
changes
Example: U.S. Census
§
§
Dynamic – changes
continuously to reflect
real-time data
Example: most online
research databases
Security Methods
§
Access Restriction
§
Query Set Restriction
§
Microaggregation
§
Data Perturbation
§
Output Perturbation
§
Auditing
§
Random Sampling
Access Restriction
§
§
Databases normally have different access
levels for different types of users
User ID and passwords are the most common
methods for restricting access
§
In a medical database:
§
Doctors/Healthcare Representative – full access to information
§
Researchers – only access to partial information (e.g. aggregate information)
Query Set Restriction
§
§
§
A query-set size control can limit the number
of records that must be in the result set
Allows the query results to be displayed only
if the size of the query set satisfies the
condition
Setting a minimum query-set size can help
protect against the disclosure of individual
data
Query Set Restriction
§
Let K represents the minimum number or
records to be present for the query set
§
Let R represents the size of the query set
§
The query set can only be displayed if
K
R
Query Set Restriction
Query 2
Query 1
Original
Database
Query 1
Results
K
Query
Results
Query 2
Results
K
Query
Results
Microaggregation
§
§
§
§
Raw (individual) data is grouped into small aggregates
before publication
The average value of the group replaces each value of
the individual
Data with the most similarities are grouped together to
maintain data accuracy
Helps to prevent disclosure of individual data
Microaggregation
§
§
§
National Agricultural Statistics Service (NASS)
publishes data about farms
To protect against data disclosure, data is only
released at the county level
Farms in each county are averaged together to
maintain as much purity, yet still protect against
disclosure
Microaggregation
Age
Microaggregated
Age
10
11.67
12
Average
11.67
13
11.67
57
56.67
54
59
Average
56.67
56.67
Microaggregation
User
Original
Data
Averaged
Microaggregated
Data
Data Perturbation
§
§
§
Perturbed data is raw data with noise added
Pro: With perturbed databases, if unauthorized data is
accessed, the true value is not disclosed
Con: Data perturbation runs the risk of presenting biased
data
Data Perturbation
User 1
Noise Added
Original
Database
Perturbed
Database
User 2
Output Perturbation
§
§
Instead of the raw data being transformed as in Data
Perturbation, only the output or query results are
perturbed
The bias problem is less severe than with data
perturbation
Output Perturbation
Query
User 1
Results
Noise Added
to Results
Original
Database
Query
Results
User 2
Auditing
§
§
§
Auditing is the process of keeping track of all queries made by
each user
Usually done with up-to-date logs
Each time a user issues a query, the log is checked to see if the
user is querying the database maliciously
Random Sampling
§
§
§
Only a sample of the records meeting the requirements
of the query are shown
Must maintain consistency by giving exact same results
to the same query
Weakness - Logical equivalent queries can result in a
different query set
Comparison Methods
The following criteria are used to determine the most effective
methods of statistical database security:
§
Security – possibility
of exact disclosure, partial
disclosure, robustness
§
Richness of Information – amount
§
Costs – initial
of non-confidential
information eliminated, bias, precision,
consistency
implementation cost, processing
overhead per query, user education
A Comparison of Methods
Method
Security
Richness of
Information
Costs
Query-set Restriction
Low
Low1
Low
Microaggregation
Moderate
Moderate
Moderate
Data Perturbation
High
High-Moderate
Low
Moderate
Moderate-low
Low
Auditing
Moderate-Low
Moderate
High
Sampling
Moderate
Moderate-Low
Moderate
Output Perturbation
1 Quality is low because a lot of information can be
eliminated if the query does not meet the requirements
Sources
§
§
This presentation is posted on
/>Adam, Nabil R. ; Wortmann, John C.; SecurityControl Methods for Statistical Databases: A
Comparative Study; ACM Computing Surveys, Vol.
21, No. 4, December 1989 (
/>)
§
§
Official HIPAA – ( incur
Bernstein, Stephen W.; Impact of HIPAA on
BioTech/Pharma Research: Rules of the Road (
/>
§
Service Bureau; 3rd Party Testing (
/>