Many strands are brought together in this smart review of the coming merge of social and search, which will take much further the secondary role social is already playing in some search engines.
It raises 3 core questions.
1. I am entirely happy that clever algorithms should put together the strands of my digital (and analog) life to my benefit. I am entirely unhappy if there is no clear way found to keep this info entirely, forever, private to me; unless I choose to part with it for cash or some specific service.
2. I am also happy to have a tailored version of search operating in particular situations (so when I search “weather,” the first hit is my local weather not the dictionary of meteorology. But in less obvious cases not only do I want a choice, of social search and, as it were, asocial search; I want a flashing light to remind me that my private universe is being mined, not the universe out there.
3. There’s a fundamental distinction to be drawn between biographical, geographical, or personal preferences in matters of, say, food and music; and broad issues of information and opinion. So it is not at all OK that a Democrat should get a view of history and politics designed to be favorable to him or her; or that doubters of human causation of climate change should receive preferentially material favorable to their cause.
Search and Social: How The Two Will Soon Become One | TechCrunch.
I don’t think that most people realize how much can potentially be inferred from these data. You might think that you censor yourself such that someone might be able to determine your brand preferences, but nothing “personal”. That’s probably false, and will soon be provably false.
Jason Hong, a professor at Carnegie-Mellon University, has a research program using data from social networks to determine mental health characteristics (here’s a job ad for a postdoc: http://www.cs.cmu.edu/~jasonh/postdoc-ubicomp-mental-health.pdf). Now the good news is that this is visible research. Jason is an acquaintance of mine, and I know him to have exemplary ethics; I don’t think he will be developing this to allow, say, a company to search a private list so that they can avoid hiring someone with depression. However, as public research, if it’s successful, it’s going to lead to that very problem. Just as polygraphs were used at times in the past (and still are by the US security clearance process) by interviewers, there is potential for misuse of this technology.
The reason it’s possible to determine something private (say, depression) from public data is because we can measure the similarities between two people (and between lots of people) using a variety of data analysis techniques involving set theory and linear algebra (I’ll leave the technical details for a later post on my blog). What happens is that, with the density of data, we can determine the factors that in aggregate are likely to indicate depression (e.g., perhaps high or low posting levels, terms used, the color of your avatar, etc.) by using a relatively small set of known people. From that, we can see if you (yes, you personally) are similar to the model we’ve created. Is your posting pattern similar to what we know a depressed person might do? You might have depression. We can even potentially figure out the likelihood that you have depression, that you have a clinical diagnosis of depression, the likelihood that you are in some form of treatment (as well as the form of treatment itself), and perhaps your likelihood of having some sort of “breakdown”.
Disturbed yet? It gets worse. These tools were what I wanted to (and did) use in my own dissertation (which ended up being about deriving networks of friends/colleagues from your email), so I proposed to several people that we make a statement about the sophistication of this research by finding people likely to be HIV+. My goal was not to show these people to the public (they would have been immensely protected), but to show that we need to consider how these data are used.
I’m not going to go as far as to say that we need laws to protect people from the conclusions derived from these data, or to stop people from deriving potentially dangerous conclusions. That may be necessary, but the mechanisms for protection are beyond my areas of expertise.
I will also say that there are clear mechanisms of separating the conclusion from the data, and isolating those conclusions for either anonymized aggregate analysis (e.g., how many potentially depressed people are there on Twitter?) or use by the individual only. I should also point out that these mechanisms are predictive, not definite; just because Professor Hong’s research shows you might be depressed does not mean that you must be depressed. However, that’s part of the problem; if Company X gets ahold of your “maybe depressed” diagnosis, it can be a bias against you (“let’s raise his auto insurance or lower her credit score”), despite being inaccurate to a known or unknown level.
Thank you. It’s plain that as the quantity and kind of data that is aggregated grow, side by side with analytic sophistication, we ain’t seen nothing yet.
I remain somewhat confident that this situation will turn around, and profitable business models emerge that do not depend on wholesale use/abuse of personal data. How and when is another matter. One contributor may be the failure of one of the major brands in a privacy-related scandal. The push of European regulators is also to be noted, as they are slowly setting global standards on their own. . . .