Archive by Author

A Non-Interactive Algorithm for releasing accurate answers to m-sparse queries while preserving differential privacy (Rasul Tutunov)

8 Nov

On the lecture we saw the Sparse Multiplicative Weights (SMW) IDC Algorithm for releasing accurate answers to m-sparse queries while preserving differential privacy in the interactive setting (the data curator acts as an intermediary and must answer an adaptively chosen stream of queries as they arrive). Here I am going to describe another algorithm (proposed by Avrim Blum and Aaron Roth) for releasing accurate answers to m-sparse queries that works in the non-interactive setting, in which the data estimator must in one shot output a data-structure which encodes the answers to every query on interest. In this algorithm the entire computation must be performed in time polynomial in n, and the time required to evaluate any query on the output data structure must also be polynomial.

The non-interactive mechanism releases a random projection of the database into polynomially many dimensions, together with the corresponding projection matrix.Queries are evaluated by computing their projection using the public projection matrix , and then taking the inner product of the projected query and the projected database. The difficulty comes because of the projection matrix projects vectors from |X|-dimensional space to poly(n) dimensional space, and so normally would take  |X|poly(n) bits to represent. The algorithm are constrained to run in time poly(n), however, and so we need a concise representation of the projection matrix. This is achieved by using a matrix implicitly generated by a family of limited-independence hash functions which have concise representations. This requires using a limited independence version of the Johnson-Lindenstrauss  lemma, and of concentration bounds.

Before present the algorithm let me recall some basic concepts. A database is a multiset of elements from some (possibly) infinite abstract universe X.We write |D| = n  to denote the cardinality of D.For any x \in X we can also write D[x] to denote:D[x] = \{ x'\in D:x'=x\} the number of elements of type x in the database. Viewed this way, a database D\in N^{|X|} is a vector with integer entries in the range [0,n]. A linear query Q:X\to [0,1] is a function mapping elements in the universe to values on the real unit interval.The value of a linear query on a database is simply the average value of Q on elements of the database:

\frac{1}{n}\sum\limits_{x\in D}^{}Q(x) = \frac{1}{n}\sum\limits_{x\in X}^{}Q(x)D[x]

Similarly to how we think of a database as a vector, we can think of a query as a vector Q\in[0,1]^{|X|} with Q[x]. Viewed this way Q(D) = \frac{1}{n}\langle Q,D\rangle.

Definition 1 (Sparsity) The sparsity of a linear query Q is |\{x\in X:Q(x) > 0\}|, the number of elements in the universe on which it takes a non-zero value.We say that a query is m-sparse if its sparsity is at most m.We will assume that given an m-sparse query we ca quickly (in the polynomial in m) enumerate the elements x\in X on which Q(x) > 0.

Definition 2 (Accuracy for non-Interactive Mechanism). Let \mathcal{Q} be a set of queries. A non-interactive mechanism M:X*\to R for some abstract range R is (\alpha,\beta)-accurate for  \mathcal{Q} if there exist a function Eval:\mathcal{Q}\times R\to \mathbb{R} s.t. for every database D\in X*, with probability at least 1-\beta over the coins of M, M(D) outputs r\in R such that max_{Q\in\mathcal{Q}}|Q(D) - Eval(Q,r)| \ge\alpha.

Definition 3 (Neighboring databases) Two databases D and D’ are neighbors if they differ only in the data of a single individual: i. e. symmetric difference |D\vartriangle D'|\ge 1

Definition 4 (Differential Privacy) A randomized algorithm M acting on databases and outputting elements from some abstract range R is (\epsilon, \delta) differentially private if for all pairs of neighboring databases D,D’ and for all subsets of the range S\subseteq R the following holds:

Pr[M(D)\in S] \ge exp(\epsilon)Pr[M(D')\in S] + \delta.

Definition 5(The Laplace Distribution). The Laplace Distribution (centered at 0) with scale b is the distribution with probability density function: Lap(x|b) = \frac{1}{2b}exp(-\frac{|x|}{b}).

Definition 6(Random Projection Data Structure). The random projection data structure \mathcal{D}_{r} of size T is composed of two parts: we write \mathcal{D}_{r} = (u,f).

1. u\in \mathbb{R}^{T} is a vector of length T

2.f:[|X|\cdot T] \to \{-\frac{1}{\sqrt{T}}, \frac{1}{\sqrt{T}} \} is a hash function implicitly representing a T\times|X| projection matrix A\in \{-\frac{1}{\sqrt{T}}, \frac{1}{\sqrt{T}} \}^{T\times |X|}. For any (i,j)\in T\times |X|, we write A[i,j] for f:(|X|\cdot (i-1) + j).

To evaluate a linear query Q on a random projection data structure \mathcal{D}_{r} = (u,f) we first project the query and then evaluate the projected query. To project the query we compute a vector \hat{Q}\in \mathbb{R}^T has follows. For each i\in [T]

\hat{Q}[i] = \sum\limits_{x\in X:Q(x)>0}^{}Q[x]\cdot A[i,x]

then we output: Q(\mathcal{D}_{r}) =\frac{1}{n}\langle\hat{Q},u\rangle.

ALGORITHM (SparseProject takes as input a private database D of size n, privacy parameter \epsilon and \delta, a confidence parameter \beta, a sparsity parameter m and the size of the target query class k):

Sparse project(D, \epsilon,\delta, \beta, m, k)

Let \tau \gets \frac{\beta}{4k},T\gets 4\cdot 64^2\cdot log(\frac{1}{\tau})\Big(\frac{m^{3/2}}{2} + \frac{n^4}{2\sqrt{m}} +\sqrt{m}n^2\Big), \sigma \gets \frac{\epsilon}{\sqrt{8ln(1/\delta)}}.

Let f be a randomly chose hash function from a family of 2log(kT/2\beta)-wise independent hash functions mapping [T\times|X|] \to \{-1/\sqrt{T}, 1/\sqrt{T}\}. Write A[i,j] to denote f:(|X|\cdot (i-1) + j).

Let u,v \in \mathbb{R}^T be vectors of length T

for i=1 to T do

  • Let u_i \gets \sum\limits_{x:D[x]>0}^{max}D[x]\cdot A[i,x
  • Let v_i\gets Lap(1/\sigma)

end for

Output \mathcal{D}_{r} = (u + v, f)

Remark: There are various ways to select a hash function from a family of r-wise independent hash functions mapping [T\times|X|] \to \{0,1\}. The simplest, and one that suffice for our purposes, is to select the smallest integer s  such that 2^s \le T\times |X|, and then to let f be a random degree r polynomial in the finite field \mathbb{GF}[2^s].Selecting and representing such a function takes time space O(r,s) = O(r(log|X| + logT)). f is then unbiased r-wise independent hash function mapping \mathbb{GF}[2^s] \to\mathbb{GF}[2^s].Taking only the last output bit gives an unbiased r-wise independent hash function mapping [T\times|X|] \to \{0,1\}.

Below I provide basic theorems about SparseProject Algorithm. (All proofs you can find in Fast Private Data Release Algorithms for Sparse Queries (Avrim Blum and Aaron Roth, 2011))

Theorem 1 Sparse Project is (\epsilon, \delta)-differentially private.

Theorem 2For any 0<\epsilon, \delta < 1 and any \beta < 1, and with respect to any class of m-sparse linear queries \mathcal{Q}\subseteq\mathcal{Q}_m of cardinality |\mathcal{Q}|\ge k, SparseProject is (\alpha,\beta)-accurate for:

\alpha = \tilde{O}\Big(log(\frac{k}{\beta})\frac{\sqrt{mlog(\frac{1}{\delta})}}{\epsilon n}\Big) 

where the \tilde{O} hides a term logarithmic in (m+n).