## Lower Bound for Noise in a Private Mechanism (Justin Hsu)

21 Sep

Before differential privacy was formally defined, there was work done on privacy for statistical databases. A 2003 paper by Irit Dinur and Kobbi Nissim, titled Revealing Information while Preserving Privacy, considers many of the same issues as differential privacy. The paper defines a notion of privacy, and establishes a tight bound on how much noise must be added to protect database privacy.

The paper first models a statistical database as a string of bits, together with a query-responder that answers sum-subset queries against this database with some added perturbation, to protect privacy. Without a definition of differential privacy, the authors did not try to give a definition of privacy, but instead defined non-privacy, which happens if with high probability, an adversary can reconstruct almost all of the database by querying the database.

As expected, if the adversary can issue many queries, a large amount of noise is needed. The main result of the paper is that if less than $O(\sqrt{n})$ noise is added, then the database is non-private given a polynomially bounded adversary. The main idea is for the adversary to draw a polynomial number of queries at random, and solve a linear program to find a database that is consistent with the responses, knowing that the noise is $o(\sqrt{n})$. The paper shows that with high probability, at least one of the queries chosen will disqualify every string that differs from the actual database on some fraction of the elements. In other words, once the potential strings that don’t agree with the query answer are weeded out, all the strings that almost entirely reconstruct the hidden database (with high probability).

A further wrinkle is that this lower bound on noise is tight. That is, the paper produces an algorithm with $O(\sqrt{n})$ noise that is provably private. However, since there is no good definition of privacy here (only non-privacy), the only way to have a private algorithm is to have one that reveals almost no useful information at all.

This paper is interesting for a few other reasons. Firstly, the authors briefly explore a “CD” model, where the database is perturbed and “written on a CD” and distributed to the adversary, who can make arbitrary queries against this modified database. Secondly, because the paper investigates how much the noise can be reduced, if we are willing to constrain the adversary complexity further (say, for an adversary that can issue linear, or logarithmic number of queries). Finally, the paper indicates that a better definition of privacy is needed, otherwise very little usable information can be released.

## Talks instead of class on Thursday

20 Sep

As I announced in class today, I will not be lecturing this Thursday 9/22. This is because in a stroke of good luck, we have two excellent talks on differential privacy, given by Adam Smith and Sofya Raskhodnikova. Note that Sofya’s talk will be in a different time and place than I announced in class!

Adam’s talk will be at noon on 9/22 in Levine 307.

Sofya’s talk will be at 1:30 on 9/22 on Towne 315 (This is where and when we usually have class!)

Title: “Differential Privacy in Statistical Estimation Problems”

Consider an agency holding a large database of sensitive personal information — medical records, census survey answers, or web search records, for example. The agency would like to discover and publicly release global characteristics of the data (say, to inform policy and business decisions) while protecting the privacy of individuals’ records. This problem is known variously as “statistical disclosure control”, “privacy-preserving data mining” or simply “database privacy”.

This talk will describe differential privacy, a notion which emerged from a recent line of work in theoretical computer science that seeks to formulate and satisfy rigorous definitions of privacy for such statistical databases.

After a brief introduction to the topic, I will discuss recent results on differentially private versions of basic algorithms from statistics. I will discuss both a general result that works for any asymptotically normal statistical estimator (based on a STOC 2011 paper) and some results tailored to convex optimization problems (unpublished work, joint with A. G. Thakurta).

Title: “Testing and Reconstruction of Lipschitz Functions with Applications to Differential Privacy”

A function f : D -> R has Lipschitz constant c if d_R(f(x), f(y)) <= c d_D(x, y) for all x, y in D, where d_R and d_D denote the distance functions on the range and domain of f, respectively. We say a function is Lipschitz if it has Lipschitz constant 1. (Note that rescaling by a factor of 1/c converts a function with a Lipschitz constant c into a Lipschitz function.) Intuitively, a Lipschitz constant of f is a bound on how sensitive f is to small changes in its input. Lipschitz constants are important in mathematical analysis, the theory of differential equations and other areas of mathematics and computer science. However, in general, it is computationally infeasible to find a Lipschitz constant of a given function f or even to verify that f is c-Lipschitz for a given number c.

We initiate the study of testing (which is a relaxation of the decision problem described above) and local reconstruction of the Lipschitz property of functions. A property tester, given a proximity parameter epsilon, has to distinguish functions with the property (in this case, Lipschitz) from functions that are epsilon-far from having the property, that is, differ from every function with the property on at least an epsilon fraction of the domain. A local filter reconstructs a desired property (in this case, Lipschitz) in the following sense: given an arbitrary function f and a query x, it returns g(x), where the resulting function g satisfies the property, changing f only when necessary. If f has the property, g must be equal to f.

We design efficient testers and local reconstructors for functions over domains of the form {1,…,n}^d, equipped with L1 distance, and give corresponding lower bounds. The testers that we developed have applications to programs analysis. The reconstructors have applications to data privacy. The application to privacy is based on the fact that a function f of entries in a database of sensitive information can be released with noise of magnitude proportional to a Lipschitz constant of f, while preserving the privacy of individuals whose data is stored in the database (Dwork, McSherry, Nissim and Smith, TCC 2006). We give a differentially private mechanism, based on local filters, for releasing a function f when a purported Lipschitz constant of f is provided by a distrusted client.

20 Sep

Michael pointed me to the hunch.com “Twitter Predictor”. It is a remarkable piece of software — if you have a twitter account, I suggest you try it out.

What it does is it takes a look at the publicly available twitter graph: your twitter profile, and the profiles of the people who you follow, and those that follow you. It then predicts how you will answer a wide variety of personal questions about yourself, ranging from your technological prowess, to your beliefs about religion, the death penalty, abortion, firearms, superstitions, and all matter of other seemingly unrelated things.

So how well does it work? Remarkably well.

When I tried it out, the “Twitter Predictor” asked me 20 questions and guessed my answers correctly on all 20 of them. Many of these might be considered “sensitive information”, involving my views on religion, abortion, and other hot-button issues. This is all the more remarkable given how little information they really had about me. I am not what you would call an active twitter user. I have made a grand total of 26 tweets, the last of which was in 2008. The plurality of these tweets are invitations for people to join me for lunch. I only follow 23 people, and needless to say, I haven’t “followed” anyone that I have met since 2008.

So what does this mean for privacy? I think that it is a compelling demonstration of the power of correlations in data. You might think that making the twitter graph public is innocuous enough, without realizing that it contains information beyond that. I might not mind if everyone knows who the 23 people I knew in grad school who signed up for twitter accounts in 2008 are, but I might mind if everyone knows who I voted for in the last election.

This all just serves to illustrate the problems with trying to partition the world into “sensitive information” and “insensitive information”, or with assuming that data analysts don’t know anything else about the world except what you have told them…

## Netflix De-anonymizing Attack Report (Mehdi Nikkhah)

15 Sep

As you all know, the paper “Robust De-anonymization of Large Datasets (How to Break Anonymity of the Netﬂix Prize Dataset)” by “Arvind Narayanan and Vitaly Shmatikov” is about de-anonymizing datasets which are mistakenly believed to be private with a small amount of auxiliary information. The main focus of the paper is on creating a model and proposing 2 different algorithms for the above purpose. The model is robust to sparsity and perturbation, which is added to the dataset either on purpose or due to background noise, while still outputting high precision results.

In the model they define a similarity function (called Sim()) to identify the amount of common data between different records. Another function which is used in this model is scoring function (called Score()) to rank the similarity of each record with the auxiliary data which adversary owns. This function is helpful to find the most similar record given the auxiliary information.

Now they introduce 2 algorithms, which are slightly different in the sense that the first one looks at the data uniformly, but the second has a weighting process trying to capture rare events and giving them high weight. The example that the authors has used is as follows:

“ it is more useful to know that the target has purchased “The Dedalus Book of French Horror” than the fact that she purchased a Harry Potter book”

At the end they apply the second algorithm to the released dataset of Netflix which was believed by Netflix administration to be private as they have stated in FAQ page of their website the following discussion:

Q: “Is there any customer information in the dataset that should be kept private?”

A: “No, all customer identifying information has been removed; all that remains are ratings and dates. This follows our privacy policy, which you can review here. Even if, for example, you knew all your own ratings and their dates you probably couldn’t identify them reliably in the data because only a small sample was included (less than one-tenth of our complete dataset) and that data was subject to perturbation. Of course, since you know all your own ratings that really isn’t a privacy problem is it?”

The authors have de-anonymized the dataset by cross-correlating them with IMDB (Internet Movie DataBase) ratings of the users. The answer in the lack of an oracle which tells them they are right, is still precise as they have found people who were exact matches of the two databases.

The conclusion of the paper is to be careful while defining privacy to prevent breaching of the system in the future. Netflix did not know by removing people’s name they will not necessarily protecting their privacy.

## Privacy breach at a Stanford Hospital.

8 Sep

A timely article in the New York times about why it is dangerous to make private data sets available in their original form, even to contractors who have signed confidentiality agreements.

## Welcome to the Algorithmic Foundations of Data Privacy

1 Sep

Welcome to CIS 800/002, The Algorithmic Foundations of Data Privacy at the University of Pennsylvania. The class will meet Tuesdays and Thursdays from 1:30-3:00 pm in Towne 315.

This class will serve as an introduction to research in differential privacy and is designed for graduate students wanting to familiarize themselves and contribute to research in this area. This course will be graded on two components: A semester-long research project, and participation, which will include contributing posts to this blog. Posts will include math, so make sure you can see the embedded $\LaTeX$ equations here:

$\Pr[M(D) \in S] \leq \exp(\epsilon)\Pr[M(D') \in S] + \delta$

The first class will be on Thursday, September 8th. See you there!