Michael pointed me to the hunch.com “Twitter Predictor”. It is a remarkable piece of software — if you have a twitter account, I suggest you try it out.
What it does is it takes a look at the publicly available twitter graph: your twitter profile, and the profiles of the people who you follow, and those that follow you. It then predicts how you will answer a wide variety of personal questions about yourself, ranging from your technological prowess, to your beliefs about religion, the death penalty, abortion, firearms, superstitions, and all matter of other seemingly unrelated things.
So how well does it work? Remarkably well.
When I tried it out, the “Twitter Predictor” asked me 20 questions and guessed my answers correctly on all 20 of them. Many of these might be considered “sensitive information”, involving my views on religion, abortion, and other hot-button issues. This is all the more remarkable given how little information they really had about me. I am not what you would call an active twitter user. I have made a grand total of 26 tweets, the last of which was in 2008. The plurality of these tweets are invitations for people to join me for lunch. I only follow 23 people, and needless to say, I haven’t “followed” anyone that I have met since 2008.
So what does this mean for privacy? I think that it is a compelling demonstration of the power of correlations in data. You might think that making the twitter graph public is innocuous enough, without realizing that it contains information beyond that. I might not mind if everyone knows who the 23 people I knew in grad school who signed up for twitter accounts in 2008 are, but I might mind if everyone knows who I voted for in the last election.
This all just serves to illustrate the problems with trying to partition the world into “sensitive information” and “insensitive information”, or with assuming that data analysts don’t know anything else about the world except what you have told them…