Intro
Randomized response
Study participants are tolld to report whether or not they have property P as follows:
flip a coin
if tails, then respond truthfully
if heads, then flip second coin and respond "Yes" if heads and "No" if tails
Privacy comes from the plausible deniability of any outcome. We can still get an accurate statistic: expected number of "Yes" is $1/4 \lvert \bar P \rvert + 3/4 \lvert P\rvert$.
If there's two databses differing only in the value of a single row, on which the same query yields differnet outputs, an adversary could recover the value of the data in the unknown row.
If we denote $x_i$ the number of records of type $i$ in database $x$, the distance between databases $x,y$ is defined as:
$\lVert x - y \rVert = \sum_i \lvert x_i - y_i\rvert$ Definition of differential privacy
A randomized algorithm $M$ is $(\epsilon, \delta)$-differentially private if for all $S\subseteq \text{Range}(M)$ (possible outputs of $M$) and for all databases $x,y$ such that $\lVert x - y \rVert \leq 1$ (i.e. differing only in one record):
$Pr(M(x)\in S) \leq \exp(\epsilon)Pr(M(y)\in S) + \delta$
Note that $Pr$ denotes the probability mass (can be more than $1$). Since this definition is symmetrical in $x$ and $y$, $Pr(M(y)\in S)$ is also bounded by $Pr(M(x)\in S)$ and we have that $Pr(M(x)\in S)\in [\exp(-\epsilon)Pr(M(y)\in S), \exp(\epsilon)Pr(M(y)\in S)]$. As $\epsilon$ decreases towards $0$, this bound gets tighter around $Pr(M(y)\in S)$, meaning the outputs must not differ by a lot, independent on whether the record is present or absent from the database.
Reading list:
https://www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf
deep learning with differential privacy: https://arxiv.org/abs/1607.00133