“There was a darkness and hatred that was hidden from the traditional sources but was quite apparent in the searches people made.”
Big data has been much hyped as the next big thing in science, but Everybody Lies sets out to show what can be done with big data that wasn’t possible before, while also acknowledging its shortcomings, and the ways it can be complemented by traditional small data collection techniques. Seth Stephens-Davidowitz makes the argument that the Google dataset he has been working with is particularly valuable, because unlike even anonymous surveys, users have an incentive to be honest, and little or no sense of wanting to impress anyone. To get the information they want from Google, they must query honestly about even the most taboo subjects, from sex to race to medical problems. Facebook, for example, is not nearly as useful, because people are consciously presenting a certain version of themselves to their friends. But if you want Google to bring you back the “best racist jokes,” you have to tell it so. You can’t hide, and still get what you want. The result is a partial but unprecedented glimpse into the human mind.
I picked up this book to get the interesting facts that Stephens-Davidowitz learned from his analyses of this revealing dataset. That said, there is also plenty of basic introduction to data collection and research methodology, which might be a bit tedious for anyone who is already familiar with this material. However, I appreciated the attention to basics when it came to statistical analysis, an area where I don’t have the same background knowledge or experience. The author also spends a good bit of time trying to convince skeptics on one side that big data is useful, and on the other side, warning evangelists of the limitations. A big dataset can actually be an encumbrance if you don’t know what questions to ask of it. However, I sometimes took issue with the way the author tried to present information in an accessible way. Comparing a large dataset to your Grandma’s lifetime of collected wisdom is more harmful than helpful because only one of those things is based on verifiable numbers rather than impressions.
One subject that doesn’t get much attention in Everybody Lies is privacy. Stephens-Davidowitz notes that the Google datasets are anonymized, and that multiple sessions by the same user are not connected. He does reference an old Yahoo dataset that released the search histories of anonymized users, which enabled a different level of pattern detection between searches made by the same individual. Later in the book, he delves into the ethics of using pattern detection from large dataset in particular situations. For example, a study has been done that examines which words in a loan application—God, promise, will pay, thank you, hospital—are most indicative of a potential default on the loan’s repayment. This study used the loan application itself, but what if in the future your suitability for a job was calculated based on analyzing patterns in the language of anything and everything you’ve ever written publicly on the internet? This really only brushes the surface of potential privacy problems.
So what did Stephens Davidowitz find that was interesting? A methodology of approximating the percentage of American men that are gay that accounts for those who are in the closet more accurately than any previous estimate. A behind the scenes look at where and when racist searches are highest that might explain discrepancies between polling numbers and Obama’s actual election results. The one telling Google search that also best predicted the current president’s success in any given electoral district, regardless of public polling numbers. These are the fascinating glimpses into the human psyche that I came for, but I had to read through a lot of other information of questionable interest to get at them.