Algorithmic Foundations of Differential Privacy [WIP]

Intro

  • requirement: independent of the presence of an individual in the database, any repsonses to queries must be equally likely to occur. Paper argues that this is an acceptable definition of privacy.
  • linkage attack: match "anonymized" records with non-anonymized records in a different dataset by leveraging combinations of features (e.g. Narayanan and Shmatikov carried out a linkage attack against anonymized ranking data published by Netflix)
  • differencing attack: taken together, the answers to the two large queries "how many people in the database have the sickle cell trait" and "how many people, not named X, in the database have the sickle cell trait" yield the sickle cell status of Mr. X. Thus, summary statistics fail as a privacy solution.

Randomized response

Study participants are tolld to report whether or not they have property P as follows:

  1. flip a coin

  2. if tails, then respond truthfully

  3. if heads, then flip second coin and respond "Yes" if heads and "No" if tails

Privacy comes from the plausible deniability of any outcome. We can still get an accurate statistic: expected number of "Yes" is 1/4Pˉ+3/4P1/4 \lvert \bar P \rvert + 3/4 \lvert P\rvert.

If there's two databses differing only in the value of a single row, on which the same query yields differnet outputs, an adversary could recover the value of the data in the unknown row.

If we denote xix_i the number of records of type ii in database xx, the distance between databases x,yx,y is defined as:

xy=ixiyi\lVert x - y \rVert = \sum_i \lvert x_i - y_i\rvert Definition of differential privacy

A randomized algorithm MM is (ϵ,δ)(\epsilon, \delta)-differentially private if for all SRange(M)S\subseteq \text{Range}(M) (possible outputs of MM) and for all databases x,yx,y such that xy1\lVert x - y \rVert \leq 1 (i.e. differing only in one record):

Pr(M(x)S)exp(ϵ)Pr(M(y)S)+δPr(M(x)\in S) \leq \exp(\epsilon)Pr(M(y)\in S) + \delta

Note that PrPr denotes the probability mass (can be more than 11). Since this definition is symmetrical in xx and yy, Pr(M(y)S)Pr(M(y)\in S) is also bounded by Pr(M(x)S)Pr(M(x)\in S) and we have that Pr(M(x)S)[exp(ϵ)Pr(M(y)S),exp(ϵ)Pr(M(y)S)]Pr(M(x)\in S)\in [\exp(-\epsilon)Pr(M(y)\in S), \exp(\epsilon)Pr(M(y)\in S)]. As ϵ\epsilon decreases towards 00, this bound gets tighter around Pr(M(y)S)Pr(M(y)\in S), meaning the outputs must not differ by a lot, independent on whether the record is present or absent from the database.

Reading list:

https://www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf

deep learning with differential privacy: https://arxiv.org/abs/1607.00133