Timothy Delille

Algorithmic Foundations of Differential Privacy [WIP]

Intro

requirement: independent of the presence of an individual in the database, any repsonses to queries must be equally likely to occur. Paper argues that this is an acceptable definition of privacy.
linkage attack: match "anonymized" records with non-anonymized records in a different dataset by leveraging combinations of features (e.g. Narayanan and Shmatikov carried out a linkage attack against anonymized ranking data published by Netflix)
differencing attack: taken together, the answers to the two large queries "how many people in the database have the sickle cell trait" and "how many people, not named X, in the database have the sickle cell trait" yield the sickle cell status of Mr. X. Thus, summary statistics fail as a privacy solution.

Randomized response

Study participants are tolld to report whether or not they have property P as follows:

flip a coin
if tails, then respond truthfully
if heads, then flip second coin and respond "Yes" if heads and "No" if tails

Privacy comes from the plausible deniability of any outcome. We can still get an accurate statistic: expected number of "Yes" is $1/4 \lvert \bar P \rvert + 3/4 \lvert P\rvert$ .

If there's two databses differing only in the value of a single row, on which the same query yields differnet outputs, an adversary could recover the value of the data in the unknown row.

If we denote $x_i$ the number of records of type $i$ in database $x$ , the distance between databases $x,y$ is defined as:

$\lVert x - y \rVert = \sum_i \lvert x_i - y_i\rvert$ Definition of differential privacy

A randomized algorithm $M$ is $(\epsilon, \delta)$ -differentially private if for all $S\subseteq \text{Range}(M)$ (possible outputs of $M$ ) and for all databases $x,y$ such that $\lVert x - y \rVert \leq 1$ (i.e. differing only in one record):

$Pr(M(x)\in S) \leq \exp(\epsilon)Pr(M(y)\in S) + \delta$

Note that $Pr$ denotes the probability mass (can be more than $1$ ). Since this definition is symmetrical in $x$ and $y$ , $Pr(M(y)\in S)$ is also bounded by $Pr(M(x)\in S)$ and we have that $Pr(M(x)\in S)\in [\exp(-\epsilon)Pr(M(y)\in S), \exp(\epsilon)Pr(M(y)\in S)]$ . As $\epsilon$ decreases towards $0$ , this bound gets tighter around $Pr(M(y)\in S)$ , meaning the outputs must not differ by a lot, independent on whether the record is present or absent from the database.

Reading list:

https://www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf

deep learning with differential privacy: https://arxiv.org/abs/1607.00133

links

x.com/timothydelille

linkedin.com/in/timothydelille

github.com/timothydelille