Enormous Value of Anonymously Collected Social Data
I don’t want to give the game away, but I have too many technical details on a (hopefully) imaginary conspiracy in the techno-thriller novel I’ve been writing to actually fit within its pages. I can’t resist mentioning some here.
One thing I have tried to explain is enormous value of data collected anonymously?-?and I do mean anonymously. I’ve tried to explain how few people would supply what I want over the Internet, and why in-person collection seems essential. Then I tried to explain why elaborate methods such as giving respondents gloves to wear (to avoiding leaving fingerprints), and proving to them that their answers cannot be seen in any way. Hard to do, very hard. I’ve taken page after page to explain this, but it’s just too much to put into a work of fiction. Here is why it is important:
Suppose you asked a diverse set of 10,000 people 101 questions. Among them, 100 are ordinary ones about interests, personality, educational background, and so on. But one question is highly charged, like “Have ever molested a child?” If the respondents doubt their anonymity, they are most likely to answer `No’, regardless of the facts. On the other hand, if they truly believe nobody will ever know who provided the answers on their response pages, then many will risk telling the truth.
I also tried explaining ways of using cross-validation to weed out people who are fundamentally dishonest even in conditions of anonymity, but that was too technical for a work of fiction. Anyway, it doesn’t matter that much, as long as enough people feel free to tell the truth.
The next step is to train some piece of machine-learning software such as a neural network on the answers to the 100 innocuous questions, with the incriminating answer as the goal. As tests should verify, once trained, the mechanism (e.g. the trained network) would be useful for predicting the felonious behavior of an individual from the answers to 100 innocuous questions.
It would probably not be hard to get potentially important people, job applicants, or criminal suspects to answer the 100 questions which don’t seem at all incriminating. Their answers could be used to estimate which of those people have done something terrible.
This could be useful to employers or law enforcement, but also very dangerous. In the novel I am trying to write, evil conspirators use this technique to find people susceptible to blackmail.
Now comes the tricky part. I think I can demonstrate that to filter out the felons it’s not necessary to ask people to answer the 100 apparently harmless questions. That information is already out there. I’m doing a bit better explaining the technique for accessing it, but it’s still pretty technical. I’ll put out an account of it later, if anyone seems interested in what I’ve written here.
The basic message I have for people trying to use data science in the social realm is that a response dataset from individuals who are entirely convinced of their anonymity and feel free to tell the truth is of enormous value. If it contains the right questions, it could be worth millions of dollars, many times what it cost to collect.
Consider even what Cambridge Analytical did, which influenced the US election. A more insidious way of doing this would be to ask an anonymous group of respondents 100 seemingly non-political questions and one political one, such as `Republican or Democrat’. A trained neural network could then be used to predict the political leanings of people asked only the 100 apparently non-political questions. This is a well-known technique, but users of data science often forget the importance of a truly anonymous dataset, collected from a very diverse group of people. Such a dataset could be of great value.
The basic message I have for people who like techno-thriller novels is that the use of such anonymous datasets could put enormous power in the hands of an evil conspiracy. My goal is to scare people with valid techniques of data science, without boring technical detail. But I love the details myself, and can’t resist writing about them?-?hence this story.
By the way, there is already enough data in publicly available datasets from social surveys to do this stuff. Not enough attention has ever been given to anonymity, but probably enough that they could be used to reveal some nasty personal data, for purposes of blackmail or intimidation.