As you all know, the paper “Robust De-anonymization of Large Datasets (How to Break Anonymity of the Netﬂix Prize Dataset)” by “Arvind Narayanan and Vitaly Shmatikov” is about de-anonymizing datasets which are mistakenly believed to be private with a small amount of auxiliary information. The main focus of the paper is on creating a model and proposing 2 different algorithms for the above purpose. The model is robust to sparsity and perturbation, which is added to the dataset either on purpose or due to background noise, while still outputting high precision results.
In the model they define a similarity function (called Sim()) to identify the amount of common data between different records. Another function which is used in this model is scoring function (called Score()) to rank the similarity of each record with the auxiliary data which adversary owns. This function is helpful to find the most similar record given the auxiliary information.
Now they introduce 2 algorithms, which are slightly different in the sense that the first one looks at the data uniformly, but the second has a weighting process trying to capture rare events and giving them high weight. The example that the authors has used is as follows:
“ it is more useful to know that the target has purchased “The Dedalus Book of French Horror” than the fact that she purchased a Harry Potter book”
At the end they apply the second algorithm to the released dataset of Netflix which was believed by Netflix administration to be private as they have stated in FAQ page of their website the following discussion:
Q: “Is there any customer information in the dataset that should be kept private?”
The authors have de-anonymized the dataset by cross-correlating them with IMDB (Internet Movie DataBase) ratings of the users. The answer in the lack of an oracle which tells them they are right, is still precise as they have found people who were exact matches of the two databases.
The conclusion of the paper is to be careful while defining privacy to prevent breaching of the system in the future. Netflix did not know by removing people’s name they will not necessarily protecting their privacy.