Intro
- requirement: independent of the presence of an individual in the database, any repsonses to queries must be equally likely to occur. Paper argues that this is an acceptable definition of privacy.
- linkage attack: match "anonymized" records with non-anonymized records in a different dataset by leveraging combinations of features (e.g. Narayanan and Shmatikov carried out a linkage attack against anonymized ranking data published by Netflix)
- differencing attack: taken together, the answers to the two large queries "how many people in the database have the sickle cell trait" and "how many people, not named X, in the database have the sickle cell trait" yield the sickle cell status of Mr. X. Thus, summary statistics fail as a privacy solution.
Randomized response
Study participants are tolld to report whether or not they have property P as follows:
-
flip a coin
-
if tails, then respond truthfully
-
if heads, then flip second coin and respond "Yes" if heads and "No" if tails
Privacy comes from the plausible deniability of any outcome. We can still get an accurate statistic: expected number of "Yes" is 1/4∣Pˉ∣+3/4∣P∣.
If there's two databses differing only in the value of a single row, on which the same query yields differnet outputs, an adversary could recover the value of the data in the unknown row.
If we denote xi the number of records of type i in database x, the distance between databases x,y is defined as:
∥x−y∥=∑i∣xi−yi∣
Definition of differential privacy
A randomized algorithm M is (ϵ,δ)-differentially private if for all S⊆Range(M) (possible outputs of M) and for all databases x,y such that ∥x−y∥≤1 (i.e. differing only in one record):
Pr(M(x)∈S)≤exp(ϵ)Pr(M(y)∈S)+δ
Note that Pr denotes the probability mass (can be more than 1). Since this definition is symmetrical in x and y, Pr(M(y)∈S) is also bounded by Pr(M(x)∈S) and we have that Pr(M(x)∈S)∈[exp(−ϵ)Pr(M(y)∈S),exp(ϵ)Pr(M(y)∈S)]. As ϵ decreases towards 0, this bound gets tighter around Pr(M(y)∈S), meaning the outputs must not differ by a lot, independent on whether the record is present or absent from the database.
Reading list:
https://www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf
deep learning with differential privacy: https://arxiv.org/abs/1607.00133