Read more here: http://www.nytimes.com/2011/11/30/technology/facebook-agrees-to-ftc-settlement-on-privacy.html

]]>Algorithm 1: Unweighted Vertex Cover [1]

let

for

let

pick a vertex with probability proportional to

output . let

end for

As stated the algorithm outputs a sequence of vertices, one per iteration. As remarked above, this permutation defines a vertex cover by picking the earlier occurring end point of each edge.

There is a slightly different way to implement the intuition behind the above algorithm: imagine adding “hallucinated” edges to each vertex (the other endpoints of these hallucinated edges being fresh “hallucinated” vertices), and then sampling vertices without replacement proportional to these altered degrees. However, once (say) vetices have been sampled, output the remaining vertices in random order. More specifically, given a graph , we mimic the randomized proportional-to-degree algorithm for rounds where , and output the remaining vertices in random order. That is, in each of the first rounds, we select the next vertex with probability proportional to : this is equivalent to imagining that each vertex has “hallucinated” edges in addition to its real edges. When we select a vertex, we remove it from the graph, together with the real and hallucinated edges adjacent to it. This is equivalent to picking a random (real or hallucinated) edge from the graph, and outputting a random real endpoint. Outputting a vertex affects the real edges in the remaining graph, but does not change the hallucinated edges incident to other vertices.

The privacy analysis is similar to that of theorem 5.1 of [1]: if we assume weights being for the first rounds and for the remaining rounds, we will have -differential privacy.

To analyze the utility, we couple the algorithm with a run of the 2-approximation non-private algorithm that at each step picks an arbitrary edge of the graph and then picks a random endpoint. We refer to vertices that have non-zero “real” degree at the time they are selected by our algorithm as interesting vertices: the cost of our algorithm is simply the number of interesting vertices it selects in the

course of its run. Let denote the number of interesting vertices it selects during the first steps, and denote the number of interesting vertices it selects during its remaining steps, when it is simply ordering vertices randomly. Clearly, the total cost is . We may view the first phase of our algorithm as selecting an edge at random (from among both real and hallucinated ones) and then outputting one of its endpoints at random. Now, for the rounds in which our

algorithm selects a real edge, we can couple this selection with one step of an imagined run of (selecting the same edge and endpoint). Note that this run of maintains a vertex cover that is a subset of our vertex cover, and that once our algorithm has completed a vertex cover, no interesting vertices remain. Therefore, while our algorithm continues to incur cost, has not yet found a vertex cover. In the first phase of our algorithm, every interesting vertex our algorithm selects has at least one real edge adjacent to it, as well as hallucinated edges. Conditioned on selecting an interesting vertex, our algorithm had selected a real edge with probability at least . Let denote the random variable that represents the number of steps is run for. since is a 2-approximation algorithm. By linearity of expectation:

Gupta et. al in [1] show that most of the algorithm’s cost comes from the first phase and hence that is not much larger than :

Combining these facts, we get that:

[1] A. Gupta, K. Ligett, F. McSherry, A. Roth, and K. Talwar. Differentially private combinatorial optimization. Nov 2009

[2] L. Pitt. A simple probabilistic approximation algorithm for vertex cover. Technical report, Yale University, 1985

]]>private queries. This post is in two parts: the first deals with the

Fuzz programming language and how it certifies programs as

differentially private through the use of a novel statically checked

type system, and the second with a runtime that takes care of some of

the real world problems that arise when we try and achieve differential privacy in practice.

“[A type system is a] tractable syntactic method for proving the

absence of certain program behaviors by classifying phrases according

to the kinds of values they compute” [3]. Importantly, a sound type

system provides a mathematical proof of the absence of those program

behaviors. The type systems most programmers are familiar with are

fairly conservative — for example, the Java programming language has

a type system that certifies (among other things) that the return

value of a method is guaranteed to be a subtype of the reference that

it is assigned to. We can also have stronger type systems that check

more complex properties, but may run into issues such as the

properties being computationally hard or even undecidable.

Fuzz is a functional programming language with a static type

system. The high level goal is to have the type system check if a

given program is differentially private. This type system captures a

notion of function sensitivity — “A function is said to be

c-sensitive if it can magnify distances between inputs by a factor of

at most c” [1]. Thus a Fuzz program can get an upper bound on the

sensitivity of a given program from its input database. Once we know

the sensitivity of the function, we can use the Laplace mechanism to

add a sufficient amount of noise so that the output of the program is

differentially private.

In the general case, calculating the sensitivity of an arbitrary

program is non-trivial, so the Fuzz language employs a very restricted

set of primitives and operators for which static checking is

tractable. The sensitivities of these primitives are proven in [1],

and the type system is proved to be sound. This means that the when a

type checker admits a program as having at most sensitivity c, we have

a _proof_ of this property [4]. Importantly, these primitives are

compositional — if f(x) is 2-sensitive and g(y) is 3-sensitive, we

can say that g(f(x)) is 6-sensitive, which makes type checking

tractable.

Given a type system for Fuzz and a working type checker, we thus have

a system that accepts programs and certifies them as differentially

private. Given that the type checker has to be decidable, we only

admit a subset of all differentially private programs as well-typed

in Fuzz. This is a tradeoff that has to be made in all type systems

— for example, observe that the java program

if(complex expression that always evaluates to true)

return 1;

else return false;

is not well typed — but if run will always return an integer — so

we need not reject it. However, the type system cannot always perform

the sort of computation to prove that the complex expression in the

if-clause always takes the then-branch and never takes the

else-branch, so it conservatively rejects the program as

ill-typed. Similarly, the Fuzz type system is unable to certify all

differentially private programs as such.

Thus we have a few primitives in Fuzz who’s sensitivities are

verified, whic enables the programmer to construct larger programs

that the static type checker can verify as differentially private. In

practice Fuzz is in fact expressive enough to be practically useful,

and there are several examples given in [2].

Thus we have a programming language and a type system that compiles

programs that are differentially private. However, our programming

language lives in a restricted functional world with no side effects,

and unfortunately this translates poorly to the real world where we

can write programs in the vein of:

if victim.reallysecretproperty == true

then {calculate the value of pi to a billion digits; return 1}

else {return 1}

In terms of the output, the above code has sensitivity zero. But to an

adversary the secret bit of information is leaked by simply observing

how long the program takes to run. We cannot hope to statically

capture this property easily — we would have to ensure that every

program takes _exactly_ as long to run regardless of code branches,

and then we would still have to deal with memory effects, the fact

that different rows of the database occupy different cache lines (and

thus have different cache-eviction behavior that might be visible to

the adversary) and deal with programs that do not halt.

The Fuzz runtime employs a very restricted runtime environment to

prevent the leakage of information through these

side-channels. Remember, we cannot simply mitigate the flow — even a

single bit of leakage violates our differential privacy

guarantees. The usefulness of the type system and its guarantees now

becomes clear: because a type system performs its analysis on the

_structures_ of the database and the query without looking at the

contents of the database, any conclusions it reaches and reveals to

the user is always differentially private. Thus we avoid the side

channel concerns of dynamically typed systems: if our analyzer needs

to look at the contents of the database, certifying that any outputs

and side effects are differentially private requires a more complex

proof.

However we are still left to close the gap between side-effect free

functional world and the real world runtime environment. The three

main side effects are a privacy budget attack, state attacks and

timing attacks. The privacy budget attack takes the form:

If victim.reallysecretproperty == true

then {do something that uses up the entire privacy budget}

else {do nothing}

Thus by simply observing the privacy budget after executing the query,

we reveal a bit of secret information. This is fixed by disallowing

nested queries, which we can statically check at the compiler

level. The state attack is solved by not having any global variables,

given that we have a functional programming language.

The timing channel is the hardest to close — to do so, the Fuzz

runtime employs a high level strategy of running primitives that run

through the entire row of a database into “microqueries” — each

microquery only looks at a single row of the database — and the

results are then aggregated together. Microqueries are then run using

a new primitive called “predictable transactions” which take an exact

amount of time to run, down to the microsecond level, regardless of

their output value. Predictable transactions are implemented at the

runtime level: the runtime takes a default value and a timeout, and

executes the code for the microquery while timing the computation. If

the timeout is reached before computation is finished, the entire

computation is “rolled back” and the default value is returned. If

computation finishes early, the microquery is paused until the timeout

expires. Thus, the microquery always takes exactly the amount of time

specified in the timeout. This is transparent to the program: it never

knows if the default value was returned or the value was calculated by

the microquery. In the worst case where the adversary attempts to

deliberately engineer the return of the default value in the presence

of a single row, we still prevent all side channels as the “output

channel” is covered by the Laplace mechanism. The timing channel is

thus rendered non-informative.

There are additional details to make the runtime perform the rollback

step in a constant amount of time — this has to happen regardless of

what the program does within the microquery — no matter how much

memory it allocates, or what system call it attempts to trigger that

might stall us and make us miss our deadline. The important takeaway

is that the timeout deadline is a _hard_ deadline — the runtime must

be able to abort anything the adversary might do and return the

default value without delaying even a few microseconds. Thus, this

requires tight control of the garbage collector, the memory allocation

code, etc.

In conclusion, [1] demonstrates a novel type system that allows us to

automatically certify some programs as differentially private without

requiring an explicit proof with each new algorithm. [2] implements a

runtime that takes care of the real-world problems that arise when we

try and implement a programming language runtime that actually has to

seal off all the adversarial situations that could arise to break the

differential privacy guarantees when the code we are running is

untrusted.

[1] Jason Reed, Benjamin Pierce — Distance Makes the Types Grow

Stronger, ICFP 2010

[2] Andreas Haeberlen, Benjamin Pierce, Arjun Narayan — Differential

Privacy Under Fire, Usenix Security 2011

[3] Benjamin C. Pierce — Types and Programming Languages, MIT Press 2002

[4] This is not strictly true, as our implementation of the type

checker is not proven bug-free. An interesting work is [5] which

proves the differential privacy properties in Coq — an interactive

theorem prover — and builds a program checker based on Hoare Logic

in Coq and proves that the programs it admits are differentially

private. If we implemented the Fuzz type checker in Coq and proved it

correct we would be closer to a proof [6].

[5] Gilles Barthe, Boris Köpf, Federico Olmedo, and Santiago Zanella

Béguelin — Probabilistic Relational Reasoning for Differential Privacy, POPL 2012

[6] Conditioned on the Coq theorem prover being bugfree [7]

[7] And assuming ZF is consistent…[8]

]]>The non-interactive mechanism releases a random projection of the database into polynomially many dimensions, together with the corresponding projection matrix.Queries are evaluated by computing their projection using the public projection matrix , and then taking the inner product of the projected query and the projected database. The difficulty comes because of the projection matrix projects vectors from *|X|*-dimensional space to *poly(n) *dimensional space, and so normally would take *|X|poly(n) *bits to represent. The algorithm are constrained to run in time *poly(n)*, however, and so we need a concise representation of the projection matrix. This is achieved by using a matrix implicitly generated by a family of limited-independence hash functions which have concise representations. This requires using a limited independence version of the Johnson-Lindenstrauss lemma, and of concentration bounds.

Before present the algorithm let me recall some basic concepts. A database *D *is a multiset of elements from some (possibly) infinite abstract universe *X*.We write *|D| = n* to denote the cardinality of *D*.For any we can also write D[x] to denote: the number of elements of type x in the database. Viewed this way, a database is a vector with integer entries in the range *[0,n]. *A linear query is a function mapping elements in the universe to values on the real unit interval.The value of a linear query on a database is simply the average value of on elements of the database:

Similarly to how we think of a database as a vector, we can think of a query as a vector with Q[x]. Viewed this way .

**Definition 1** (Sparsity) The sparsity of a linear query Q is , the number of elements in the universe on which it takes a non-zero value.We say that a query is *m*-sparse if its sparsity is at most *m*.We will assume that given an m-sparse query we ca quickly (in the polynomial in m) enumerate the elements on which .

**Definition 2** (Accuracy for non-Interactive Mechanism). Let be a set of queries. A non-interactive mechanism for some abstract range is *-accurate for ** *if there exist a function s.t. for every database , with probability at least over the coins of outputs such that .

**Definition 3** (Neighboring databases) Two databases D and D’ are neighbors if they differ only in the data of a single individual: i. e. symmetric difference

**Definition 4** (Differential Privacy) A randomized algorithm M acting on databases and outputting elements from some abstract range is differentially private if for all pairs of neighboring databases D,D’ and for all subsets of the range the following holds:

.

**Definition 5**(The Laplace Distribution). The Laplace Distribution (centered at 0) with scale *b* is the distribution with probability density function: .

**Definition 6**(Random Projection Data Structure). The random projection data structure of size T is composed of two parts: we write .

1. is a vector of length T

2. is a hash function implicitly representing a projection matrix . For any , we write A[i,j] for .

To evaluate a linear query Q on a random projection data structure we first project the query and then evaluate the projected query. To project the query we compute a vector has follows. For each

then we output: .

**ALGORITHM (**SparseProject takes as input a private database D of size *n, *privacy parameter and , a confidence parameter , a sparsity parameter *m *and the size of the target query class *k***):**

**Sparse project**()

**Let** .

**Let **f be a randomly chose hash function from a family of -wise independent hash functions mapping . Write A[i,j] to denote .

**Let ** be vectors of length T

**for** *i=1 *to T **do**

**Let****Let**

**end for**

**Output**

**Remark**: There are various ways to select a hash function from a family of *r*-wise independent hash functions mapping . The simplest, and one that suffice for our purposes, is to select the smallest integer *s* such that , and then to let f be a random degree *r* polynomial in the finite field .Selecting and representing such a function takes time space . f is then unbiased *r-*wise independent hash function mapping .Taking only the last output bit gives an unbiased *r-*wise independent hash function mapping .

Below I provide basic theorems about SparseProject Algorithm. (All proofs you can find in *Fast Private Data Release Algorithms for Sparse Queries* (Avrim Blum and Aaron Roth, 2011))

**Theorem 1*** Sparse Project is -differentially private.*

**Theorem 2***For any and any , and with respect to any class of m-sparse linear queries of cardinality , SparseProject is -accurate for:*

* *

*where the hides a term logarithmic in (m+n).*

Privacy concerns can be viewed as a form of strategic play: participants are worried that the specific values of their inputs may result in noticeably different outcomes and utilities. While results from Mechanism Design can provide interesting privacy-preserving algorithms, this paper tries to look into the converse and show that strong privacy guarantees, such as given by differential privacy, can enrich the field of Mechanism Design.

In this paper we see that differential privacy leads to a relaxation of truthfulness assumption i.e. the incentive to lie about a value is not zero, but it is tightly controlled. This approximate truthfulness also provides guarantees in the presence of collusion, and under repeated applications of the mechanism. Moreover, it will allow us to create mechanisms for problems that cannot be handled with strict truthfulness.

The paper first introduces the use of differential privacy as a solution concept for mechanisms design. Authors show that mechanisms with differential privacy are approximate dominant strategy under arbitrary player utility functions. More accurately, mechanisms satisfying -differential privacy make truth telling an -dominant strategy for any utility function mapping to .

They also show that such mechanisms are resilient to coalitions, i.e. people cannot collude to improve their utilities. The mechanisms also allow repeated runs i.e. if we run them for several times, people cannot lie in the early rounds to lead to a more desirable utility in later rounds. Note that these guarantees are approximate: incentives are present, though arbitrarily small as controlled by the parameter .

Secondly, the paper expands the applicability of differential privacy by presenting the exponential mechanism that can be applied in a much larger settings. The authors show that the mechanism is very unlikely to produce undesirable outputs and prove guarantees about the quality of the output, even for functions that are not robust to additive noise, and those whose output may not even permit perturbation.

Finally, they apply this general mechanism to several problems in unlimited supply auctions and pricing, namely the problem of single-commodity pricing, attributes auctions, and structurally constrained pricing problems. Although auctions permit offering different prices to different players, all of the results in this work are single price, and envy-free.

\textbf{Unlimited Supply Auctions.} The digital goods auction problem involves a set of bidders. Each one has a private utility for a good at hand, and the auctioneer has an unlimited supply of the good. The bidders submit bids in , and the auctioneer determines who receives the good at what prices. Let denote the optimal fixed price revenue that the auctioneer could earn from the submitted bids. By applying the exponential mechanism to it, the authors show that there is a mechanism for the digital goods auction problem giving -differential privacy, with expected revenue at least .

A comparable result which uses machine learning applied to random samples of users to arrive at a truthful mechanism, gives expected revenue.

\textbf{Attribute Auctions.} The digital goods attribute auction problem adds public, immutable attributes to each participant. The output of mechanisms for this problem describe a market segmentation of the attribute space, and prices for each market segment. Let denote the optimal revenue over segmentations into markets, and , the number of segmentations of the specific bidders at hand into markets. The result presented about this problem is that “There is a mechanism for the digital goods attribute auction problem giving -differential privacy, with expected revenue at least “.

\textbf{Constrained Pricing Problems.} Here bidders submit a bid for each of different kinds of products, and the mechanism is constrained to sell at most one kind of product at a fixed price. The authors show that there is a mechanism for this problem giving -differential privacy, with expected revenue at least , where is the number of items sold in .

]]>Before differential privacy was formally defined, there was work done on privacy for statistical databases. A 2003 paper by Irit Dinur and Kobbi Nissim, titled Revealing Information while Preserving Privacy, considers many of the same issues as differential privacy. The paper defines a notion of privacy, and establishes a tight bound on how much noise must be added to protect database privacy.

The paper first models a statistical database as a string of bits, together with a query-responder that answers sum-subset queries against this database with some added perturbation, to protect privacy. Without a definition of differential privacy, the authors did not try to give a definition of privacy, but instead defined non-privacy, which happens if with high probability, an adversary can reconstruct almost all of the database by querying the database.

As expected, if the adversary can issue many queries, a large amount of noise is needed. The main result of the paper is that if less than noise is added, then the database is non-private given a polynomially bounded adversary. The main idea is for the adversary to draw a polynomial number of queries at random, and solve a linear program to find a database that is consistent with the responses, knowing that the noise is . The paper shows that with high probability, at least one of the queries chosen will disqualify every string that differs from the actual database on some fraction of the elements. In other words, once the potential strings that don’t agree with the query answer are weeded out, all the strings that almost entirely reconstruct the hidden database (with high probability).

A further wrinkle is that this lower bound on noise is tight. That is, the paper produces an algorithm with noise that is provably private. However, since there is no good definition of privacy here (only non-privacy), the only way to have a private algorithm is to have one that reveals almost no useful information at all.

This paper is interesting for a few other reasons. Firstly, the authors briefly explore a “CD” model, where the database is perturbed and “written on a CD” and distributed to the adversary, who can make arbitrary queries against this modified database. Secondly, because the paper investigates how much the noise can be reduced, if we are willing to constrain the adversary complexity further (say, for an adversary that can issue linear, or logarithmic number of queries). Finally, the paper indicates that a better definition of privacy is needed, otherwise very little usable information can be released.

]]>

Adam’s talk will be at noon on 9/22 in Levine 307.

Sofya’s talk will be at 1:30 on 9/22 on Towne 315 (This is where and when we usually have class!)

**Speaker: Adam Smith**

**Title: “Differential Privacy in Statistical Estimation Problems”**

Consider an agency holding a large database of sensitive personal information — medical records, census survey answers, or web search records, for example. The agency would like to discover and publicly release global characteristics of the data (say, to inform policy and business decisions) while protecting the privacy of individuals’ records. This problem is known variously as “statistical disclosure control”, “privacy-preserving data mining” or simply “database privacy”.

This talk will describe differential privacy, a notion which emerged from a recent line of work in theoretical computer science that seeks to formulate and satisfy rigorous definitions of privacy for such statistical databases.

After a brief introduction to the topic, I will discuss recent results on differentially private versions of basic algorithms from statistics. I will discuss both a general result that works for any asymptotically normal statistical estimator (based on a STOC 2011 paper) and some results tailored to convex optimization problems (unpublished work, joint with A. G. Thakurta).

**Speaker: Sofya Raskhodnikova**

**Title: “Testing and Reconstruction of Lipschitz Functions with Applications to Differential Privacy”**

A function f : D -> R has Lipschitz constant c if d_R(f(x), f(y)) <= c d_D(x, y) for all x, y in D, where d_R and d_D denote the distance functions on the range and domain of f, respectively. We say a function is Lipschitz if it has Lipschitz constant 1. (Note that rescaling by a factor of 1/c converts a function with a Lipschitz constant c into a Lipschitz function.) Intuitively, a Lipschitz constant of f is a bound on how sensitive f is to small changes in its input. Lipschitz constants are important in mathematical analysis, the theory of differential equations and other areas of mathematics and computer science. However, in general, it is computationally infeasible to find a Lipschitz constant of a given function f or even to verify that f is c-Lipschitz for a given number c.

We initiate the study of testing (which is a relaxation of the decision problem described above) and local reconstruction of the Lipschitz property of functions. A property tester, given a proximity parameter epsilon, has to distinguish functions with the property (in this case, Lipschitz) from functions that are epsilon-far from having the property, that is, differ from every function with the property on at least an epsilon fraction of the domain. A local filter reconstructs a desired property (in this case, Lipschitz) in the following sense: given an arbitrary function f and a query x, it returns g(x), where the resulting function g satisfies the property, changing f only when necessary. If f has the property, g must be equal to f.

We design efficient testers and local reconstructors for functions over domains of the form {1,…,n}^d, equipped with L1 distance, and give corresponding lower bounds. The testers that we developed have applications to programs analysis. The reconstructors have applications to data privacy. The application to privacy is based on the fact that a function f of entries in a database of sensitive information can be released with noise of magnitude proportional to a Lipschitz constant of f, while preserving the privacy of individuals whose data is stored in the database (Dwork, McSherry, Nissim and Smith, TCC 2006). We give a differentially private mechanism, based on local filters, for releasing a function f when a purported Lipschitz constant of f is provided by a distrusted client.

]]>What it does is it takes a look at the publicly available twitter graph: your twitter profile, and the profiles of the people who you follow, and those that follow you. It then predicts how you will answer a wide variety of personal questions about yourself, ranging from your technological prowess, to your beliefs about religion, the death penalty, abortion, firearms, superstitions, and all matter of other seemingly unrelated things.

So how well does it work? Remarkably well.

When I tried it out, the “Twitter Predictor” asked me 20 questions and guessed my answers correctly on all 20 of them. Many of these might be considered “sensitive information”, involving my views on religion, abortion, and other hot-button issues. This is all the more remarkable given how little information they really had about me. I am not what you would call an active twitter user. I have made a grand total of 26 tweets, the last of which was in 2008. The plurality of these tweets are invitations for people to join me for lunch. I only follow 23 people, and needless to say, I haven’t “followed” anyone that I have met since 2008.

So what does this mean for privacy? I think that it is a compelling demonstration of the power of correlations in data. You might think that making the twitter graph public is innocuous enough, without realizing that it contains information beyond that. I might not mind if everyone knows who the 23 people I knew in grad school who signed up for twitter accounts in 2008 are, but I might mind if everyone knows who I voted for in the last election.

This all just serves to illustrate the problems with trying to partition the world into “sensitive information” and “insensitive information”, or with assuming that data analysts don’t know anything else about the world except what you have told them…

]]>