Archive | September, 2011

## Mechanism Design via Differential Privacy (Hoda Heidari)

27 Sep

The field of Mechanism Design studies the design and analysis of algorithms robust to strategic manipulation of the inputs by selfish agents. Algorithmic mechanism design mostly tries to design “truthful” mechanisms where truthfulness means that the mechanism is designed so that a dominant strategy for each user is to truthfully report her value. Designing mechanisms that are truthful simplifies the analysis as users might not apply gaming to get a higher expected utility. However, in many settings, assuming truthfulness seems to be too strong, and therefore it prevents the mechanism from having other useful properties.

Privacy concerns can be viewed as a form of strategic play: participants are worried that the specific values of their inputs may result in noticeably different outcomes and utilities. While results from Mechanism Design can provide interesting privacy-preserving algorithms, this paper tries to look into the converse and show that strong privacy guarantees, such as given by differential privacy, can enrich the field of Mechanism Design.

In this paper we see that differential privacy leads to a relaxation of truthfulness assumption i.e. the incentive to lie about a value is not zero, but it is tightly controlled. This approximate truthfulness also provides guarantees in the presence of collusion, and under repeated applications of the mechanism. Moreover, it will allow us to create mechanisms for problems that cannot be handled with strict truthfulness.

The paper first introduces the use of differential privacy as a solution concept for mechanisms design. Authors show that mechanisms with differential privacy are approximate dominant strategy under arbitrary player utility functions. More accurately, mechanisms satisfying $\epsilon$-differential privacy make truth telling an $(\exp{\epsilon} - 1)$-dominant strategy for any utility function mapping $Range(M)$ to $[0, 1]$.

They also show that such mechanisms are resilient to coalitions, i.e. people cannot collude to improve their utilities. The mechanisms also allow repeated runs i.e. if we run them for several times, people cannot lie in the early rounds to lead to a more desirable utility in later rounds. Note that these guarantees are approximate: incentives are present, though arbitrarily small as controlled by the parameter $\epsilon$.

Secondly, the paper expands the applicability of differential privacy by presenting the exponential mechanism that can be applied in a much larger settings. The authors show that the mechanism is very unlikely to produce undesirable outputs and prove guarantees about the quality of the output, even for functions that are not robust to additive noise, and those whose output may not even permit perturbation.

Finally, they apply this general mechanism to several problems in unlimited supply auctions and pricing, namely the problem of single-commodity pricing, attributes auctions, and structurally constrained pricing problems. Although auctions permit offering different prices to different players, all of the results in this work are single price, and envy-free.

\textbf{Unlimited Supply Auctions.} The digital goods auction problem involves a set of $n$ bidders. Each one has a private utility for a good at hand, and the auctioneer has an unlimited supply of the good. The bidders submit bids in $[0, 1]$, and the auctioneer determines who receives the good at what prices. Let $OPT$ denote the optimal fixed price revenue that the auctioneer could earn from the submitted bids. By applying the exponential mechanism to it, the authors show that there is a mechanism for the digital goods auction problem giving $\epsilon$-differential privacy, with expected revenue at least $OPT - 3\ln{(e + \epsilon^2 OPT n)}/\epsilon$.

A comparable result which uses machine learning applied to random samples of users to arrive at a truthful mechanism, gives expected $OPT - O(\sqrt{OPT})$ revenue.

\textbf{Attribute Auctions.} The digital goods attribute auction problem adds public, immutable attributes to each participant. The output of mechanisms for this problem describe a market segmentation of the attribute space, and prices for each market segment. Let $OPT_k$ denote the optimal revenue over segmentations into $k$ markets, and $SEG_k$, the number of segmentations of the specific bidders at hand into $k$ markets. The result presented about this problem is that “There is a mechanism for the digital goods attribute auction problem giving $\epsilon$-differential privacy, with expected revenue at least $\max_{k}{(OPT_k - O((OPT_k^2 - 3\ln{(e+\epsilon^{k+1}OPT_k SEG_k n^{k+1})}/\epsilon)}$“.

\textbf{Constrained Pricing Problems.} Here bidders submit a bid for each of $k$ different kinds of products, and the mechanism is constrained to sell at most one kind of product at a fixed price. The authors show that there is a mechanism for this problem giving $2\epsilon$-differential privacy, with expected revenue at least $OPT - 3\ln{(e+\epsilon^2 OPT km)}/\epsilon$, where $m$ is the number of items sold in $OPT$.

## Class project

27 Sep

FYI: The list of project deadlines and some project ideas is now online: http://www.cis.upenn.edu/~aaroth/courses/projectideas.html

## Lower Bound for Noise in a Private Mechanism (Justin Hsu)

21 Sep

Before differential privacy was formally defined, there was work done on privacy for statistical databases. A 2003 paper by Irit Dinur and Kobbi Nissim, titled Revealing Information while Preserving Privacy, considers many of the same issues as differential privacy. The paper defines a notion of privacy, and establishes a tight bound on how much noise must be added to protect database privacy.

The paper first models a statistical database as a string of bits, together with a query-responder that answers sum-subset queries against this database with some added perturbation, to protect privacy. Without a definition of differential privacy, the authors did not try to give a definition of privacy, but instead defined non-privacy, which happens if with high probability, an adversary can reconstruct almost all of the database by querying the database.

As expected, if the adversary can issue many queries, a large amount of noise is needed. The main result of the paper is that if less than $O(\sqrt{n})$ noise is added, then the database is non-private given a polynomially bounded adversary. The main idea is for the adversary to draw a polynomial number of queries at random, and solve a linear program to find a database that is consistent with the responses, knowing that the noise is $o(\sqrt{n})$. The paper shows that with high probability, at least one of the queries chosen will disqualify every string that differs from the actual database on some fraction of the elements. In other words, once the potential strings that don’t agree with the query answer are weeded out, all the strings that almost entirely reconstruct the hidden database (with high probability).

A further wrinkle is that this lower bound on noise is tight. That is, the paper produces an algorithm with $O(\sqrt{n})$ noise that is provably private. However, since there is no good definition of privacy here (only non-privacy), the only way to have a private algorithm is to have one that reveals almost no useful information at all.

This paper is interesting for a few other reasons. Firstly, the authors briefly explore a “CD” model, where the database is perturbed and “written on a CD” and distributed to the adversary, who can make arbitrary queries against this modified database. Secondly, because the paper investigates how much the noise can be reduced, if we are willing to constrain the adversary complexity further (say, for an adversary that can issue linear, or logarithmic number of queries). Finally, the paper indicates that a better definition of privacy is needed, otherwise very little usable information can be released.

## Talks instead of class on Thursday

20 Sep

As I announced in class today, I will not be lecturing this Thursday 9/22. This is because in a stroke of good luck, we have two excellent talks on differential privacy, given by Adam Smith and Sofya Raskhodnikova. Note that Sofya’s talk will be in a different time and place than I announced in class!

Adam’s talk will be at noon on 9/22 in Levine 307.

Sofya’s talk will be at 1:30 on 9/22 on Towne 315 (This is where and when we usually have class!)

Title: “Differential Privacy in Statistical Estimation Problems”

Consider an agency holding a large database of sensitive personal information — medical records, census survey answers, or web search records, for example. The agency would like to discover and publicly release global characteristics of the data (say, to inform policy and business decisions) while protecting the privacy of individuals’ records. This problem is known variously as “statistical disclosure control”, “privacy-preserving data mining” or simply “database privacy”.

This talk will describe differential privacy, a notion which emerged from a recent line of work in theoretical computer science that seeks to formulate and satisfy rigorous definitions of privacy for such statistical databases.

After a brief introduction to the topic, I will discuss recent results on differentially private versions of basic algorithms from statistics. I will discuss both a general result that works for any asymptotically normal statistical estimator (based on a STOC 2011 paper) and some results tailored to convex optimization problems (unpublished work, joint with A. G. Thakurta).

Title: “Testing and Reconstruction of Lipschitz Functions with Applications to Differential Privacy”

A function f : D -> R has Lipschitz constant c if d_R(f(x), f(y)) <= c d_D(x, y) for all x, y in D, where d_R and d_D denote the distance functions on the range and domain of f, respectively. We say a function is Lipschitz if it has Lipschitz constant 1. (Note that rescaling by a factor of 1/c converts a function with a Lipschitz constant c into a Lipschitz function.) Intuitively, a Lipschitz constant of f is a bound on how sensitive f is to small changes in its input. Lipschitz constants are important in mathematical analysis, the theory of differential equations and other areas of mathematics and computer science. However, in general, it is computationally infeasible to find a Lipschitz constant of a given function f or even to verify that f is c-Lipschitz for a given number c.

We initiate the study of testing (which is a relaxation of the decision problem described above) and local reconstruction of the Lipschitz property of functions. A property tester, given a proximity parameter epsilon, has to distinguish functions with the property (in this case, Lipschitz) from functions that are epsilon-far from having the property, that is, differ from every function with the property on at least an epsilon fraction of the domain. A local filter reconstructs a desired property (in this case, Lipschitz) in the following sense: given an arbitrary function f and a query x, it returns g(x), where the resulting function g satisfies the property, changing f only when necessary. If f has the property, g must be equal to f.

We design efficient testers and local reconstructors for functions over domains of the form {1,…,n}^d, equipped with L1 distance, and give corresponding lower bounds. The testers that we developed have applications to programs analysis. The reconstructors have applications to data privacy. The application to privacy is based on the fact that a function f of entries in a database of sensitive information can be released with noise of magnitude proportional to a Lipschitz constant of f, while preserving the privacy of individuals whose data is stored in the database (Dwork, McSherry, Nissim and Smith, TCC 2006). We give a differentially private mechanism, based on local filters, for releasing a function f when a purported Lipschitz constant of f is provided by a distrusted client.

20 Sep

Michael pointed me to the hunch.com “Twitter Predictor”. It is a remarkable piece of software — if you have a twitter account, I suggest you try it out.

What it does is it takes a look at the publicly available twitter graph: your twitter profile, and the profiles of the people who you follow, and those that follow you. It then predicts how you will answer a wide variety of personal questions about yourself, ranging from your technological prowess, to your beliefs about religion, the death penalty, abortion, firearms, superstitions, and all matter of other seemingly unrelated things.

So how well does it work? Remarkably well.

When I tried it out, the “Twitter Predictor” asked me 20 questions and guessed my answers correctly on all 20 of them. Many of these might be considered “sensitive information”, involving my views on religion, abortion, and other hot-button issues. This is all the more remarkable given how little information they really had about me. I am not what you would call an active twitter user. I have made a grand total of 26 tweets, the last of which was in 2008. The plurality of these tweets are invitations for people to join me for lunch. I only follow 23 people, and needless to say, I haven’t “followed” anyone that I have met since 2008.

So what does this mean for privacy? I think that it is a compelling demonstration of the power of correlations in data. You might think that making the twitter graph public is innocuous enough, without realizing that it contains information beyond that. I might not mind if everyone knows who the 23 people I knew in grad school who signed up for twitter accounts in 2008 are, but I might mind if everyone knows who I voted for in the last election.

This all just serves to illustrate the problems with trying to partition the world into “sensitive information” and “insensitive information”, or with assuming that data analysts don’t know anything else about the world except what you have told them…

## Netflix De-anonymizing Attack Report (Mehdi Nikkhah)

15 Sep

As you all know, the paper “Robust De-anonymization of Large Datasets (How to Break Anonymity of the Netﬂix Prize Dataset)” by “Arvind Narayanan and Vitaly Shmatikov” is about de-anonymizing datasets which are mistakenly believed to be private with a small amount of auxiliary information. The main focus of the paper is on creating a model and proposing 2 different algorithms for the above purpose. The model is robust to sparsity and perturbation, which is added to the dataset either on purpose or due to background noise, while still outputting high precision results.

In the model they define a similarity function (called Sim()) to identify the amount of common data between different records. Another function which is used in this model is scoring function (called Score()) to rank the similarity of each record with the auxiliary data which adversary owns. This function is helpful to find the most similar record given the auxiliary information.

Now they introduce 2 algorithms, which are slightly different in the sense that the first one looks at the data uniformly, but the second has a weighting process trying to capture rare events and giving them high weight. The example that the authors has used is as follows:

“ it is more useful to know that the target has purchased “The Dedalus Book of French Horror” than the fact that she purchased a Harry Potter book”

At the end they apply the second algorithm to the released dataset of Netflix which was believed by Netflix administration to be private as they have stated in FAQ page of their website the following discussion:

Q: “Is there any customer information in the dataset that should be kept private?”

A: “No, all customer identifying information has been removed; all that remains are ratings and dates. This follows our privacy policy, which you can review here. Even if, for example, you knew all your own ratings and their dates you probably couldn’t identify them reliably in the data because only a small sample was included (less than one-tenth of our complete dataset) and that data was subject to perturbation. Of course, since you know all your own ratings that really isn’t a privacy problem is it?”

The authors have de-anonymized the dataset by cross-correlating them with IMDB (Internet Movie DataBase) ratings of the users. The answer in the lack of an oracle which tells them they are right, is still precise as they have found people who were exact matches of the two databases.

The conclusion of the paper is to be careful while defining privacy to prevent breaching of the system in the future. Netflix did not know by removing people’s name they will not necessarily protecting their privacy.

## Privacy breach at a Stanford Hospital.

8 Sep

A timely article in the New York times about why it is dangerous to make private data sets available in their original form, even to contractors who have signed confidentiality agreements.