Seth Stephens-Davidowitz’s book, Everybody Lies, isn’t about lying. Rather, it’s about it’s about using large-n data from Google searches and social network profiles to shed light on social questions.
I have separated my discussion of the book into two parts:
- Part 1: Distillation of Useful Information for Data Scientists (this post)
- Part 2: Dangerous Methodology and Why it Matters
Short version of my review: The book itself is a breezy read, and it has some takeaways for you to use your own data science initiatives. Please don’t take the findings seriously, though. The statistical rigor of Stephens-Davidowitz’s studies ranges in quality from questionable to unequivocally poor (I’ll share some examples).
Part I: Distillation of Useful Information for Data Scientists
Tools from Google for Large-N Data
Everybody Lies discusses three tools that you can use today to collect and analyze large-n data.
- Google Trends: Google Trends allows you to look at the history of aggregated searches for words and phrases. Stephens-Davidowitz explains how he used this tool to analyze the volume of racist searches in the United States. The short version: racist searches are alive and well in the United States, jump when black people are featured in the news, and correlate well, by county, with wage gap between white and black employees.
- Google Correlate: Google Correlate can show you correlations between aggregate search data and real-world datasets about geography, the economy, et cetera. For example, Stephens-Davidowitz discusses a positive correlation between unemployment in the United States and search volume for porn sites and Spider Solitaire.
- Google NGram: Google NGram allows you to search for the incidence of words in printed books between 1800 and 2000. You can use the tool to determine how often a word was used at different points in time and how it spread, proxied through print media.
Using Searches for Small-N Data
The above tools allow you to look at search data in the anonymized aggregate. Stephens-Davidowitz also discusses the use of specific individuals’ non-anonymized search history for some small-N studies.
For example, one set of researchers looked for patterns in the search history of patients diagnosed with pancreatic cancer compared to patients with other diagnoses. They found what appears to be a higher risk of pancreatic cancer following a few series of searches: back pain followed by yellowing skin, for example, and indigestion followed by abdominal pain. It should be noted that I do not know the statistical power of the study discussed, and statistical power is often highly suspect in small-N studies. So the results could be due to random chance even if they were deemed statistically significant.
Strengths of Online Data Collection
Stephens-Davidowitz discusses some strengths of data collected from the web:
- There is a lot of it, which can help us look at trends in the aggregate.
- It’s the real world: here is no need to worry about people behaving differently than they normally would in an experiment because we can use data collected from their normal activities.
- All the web is a lab: A/B tests have formed the UIs of much of the internet, and we can devise similar tests to gather behavioral data on internet users for social science research.
Limitations of Using Internet Data
The book also discusses some limitations of data collected from the web:
- The curse of dimensionality: if you measure enough things, something will correlate by random chance. More data makes it easier for us to fall into this trap.
- More data also gives us more opportunities to come up with poor proxies that fail to measure what we want to measure. Take the use of student test scores as a proxy for teaching quality. In this case, a poor proxy not only skewed data, but the use of those results to implement policies resulted in teachers who taught to tests, as opposed to improving as teachers.
- What’s focal is causal: when we see a metric for something, we view a thing that has a metric as more important than things we can’t measure. This helps us, again, jump to wrong conclusions. It’s a key reason for including small-n qualitative studies and subject matter experts in research design.
Before we move on, I’d like to add a limitation to this list: the questionable validity of the data itself.
Stephens-Davidowitz makes a sound claim that search data provides us with a new and potentially useful source of information for performing retrospective observational studies. That said, he gives the impression that the data is much more reliable than we know it to be.
He refers to search data as ‘digital truth serum’, a phrase he uses to embody the premise that people’s search habits reveal who they really are more than surveys (on which people are known to lie). This is probably true to some degree. It still seems shortsightedto take search information at face value.
Why? People make all kinds of searches, pretending to be all kinds of people. Take one anecdotal example from me, a hypochondriac. Anytime I go to a doctor or self-diagnose with a symptom, I take the internet at its worst possible conclusion and begin googling as if that is, in fact, what I have. If other people are like me in this regard, our search data won’t help determine who really has a given condition because many people searching for it do not have it. Similarly, in the past three days, I have searched for ‘home prices Ypsilanti,’ ‘how to press flowers’, and ‘ingredients in glass noodles.’ I have no intention of buying a home in Ypsilanti, pressing flowers, or making glass noodles. I was just curious.
Is my search behavior just one oddball example, or does it represent a larger group of Google surfers? We don’t know. When we use search data to draw conclusions, we must, as good scientists, consider that this is a data source we know very little about, both on the microcosmic level and in the aggregate.
We’ll talk more about study methodology in the second part of this review: Dangerous Methodology and Why it Matters.