Foundations of Data Privacy Course Blog

Netflix De-anonymizing Attack Report (Mehdi Nikkhah)


As you all know, the paper “Robust De-anonymization of Large Datasets (How to Break Anonymity of the Netflix Prize Dataset)” by “Arvind Narayanan and Vitaly Shmatikov” is about de-anonymizing datasets which are mistakenly believed to be private with a small amount of auxiliary information. The main focus of the paper is on creating a model and proposing 2 different algorithms for the above purpose. The model is robust to sparsity and perturbation, which is added to the dataset either on purpose or due to background noise, while still outputting high precision results.

In the model they define a similarity function (called Sim()) to identify the amount of common data between different records. Another function which is used in this model is scoring function (called Score()) to rank the similarity of each record with the auxiliary data which adversary owns. This function is helpful to find the most similar record given the auxiliary information.

Now they introduce 2 algorithms, which are slightly different in the sense that the first one looks at the data uniformly, but the second has a weighting process trying to capture rare events and giving them high weight. The example that the authors has used is as follows:

“ it is more useful to know that the target has purchased “The Dedalus Book of French Horror” than the fact that she purchased a Harry Potter book”

At the end they apply the second algorithm to the released dataset of Netflix which was believed by Netflix administration to be private as they have stated in FAQ page of their website the following discussion:

Q: “Is there any customer information in the dataset that should be kept private?”

A: “No, all customer identifying information has been removed; all that remains are ratings and dates. This follows our privacy policy, which you can review here. Even if, for example, you knew all your own ratings and their dates you probably couldn’t identify them reliably in the data because only a small sample was included (less than one-tenth of our complete dataset) and that data was subject to perturbation. Of course, since you know all your own ratings that really isn’t a privacy problem is it?”

The authors have de-anonymized the dataset by cross-correlating them with IMDB (Internet Movie DataBase) ratings of the users. The answer in the lack of an oracle which tells them they are right, is still precise as they have found people who were exact matches of the two databases.


The conclusion of the paper is to be careful while defining privacy to prevent breaching of the system in the future. Netflix did not know by removing people’s name they will not necessarily protecting their privacy.