Skip to main content
SearchLoginLogin or Signup

Review 1: "Are We There Yet? Big Data Significantly Overestimates COVID-19 Vaccination in the US"

Published onApr 14, 2022
Review 1: "Are We There Yet? Big Data Significantly Overestimates COVID-19 Vaccination in the US"
1 of 2
key-enterThis Pub is a Review of
Unrepresentative Big Surveys Significantly Overestimate US Vaccine Uptake

AbstractSurveys are a crucial tool for understanding public opinion and behavior, and their accuracy depends on maintaining statistical representativeness of their target populations by minimizing biases from all sources. Increasing data size shrinks confidence intervals but magnifies the impact of survey bias – an instance of the Big Data Paradox 1. Here we demonstrate this paradox in estimates of first-dose COVID-19 vaccine uptake in US adults: Delphi-Facebook 2,3 (about 250,000 responses per week) and Census Household Pulse 4 (about 75,000 per week). By May 2021, Delphi-Facebook overestimated uptake by 17 percentage points and Census Household Pulse by 14, compared to a benchmark from the Centers for Disease Control and Prevention (CDC). Moreover, their large data sizes led to minuscule margins of error on the incorrect estimates. In contrast, an Axios-Ipsos online panel 5 with about 1,000 responses following survey research best practices 6 provided reliable estimates and uncertainty. We decompose observed error using a recent analytic framework 1 to explain the inaccuracy in the three surveys. We then analyze the implications for vaccine hesitancy and willingness. We show how a survey of 250,000 respondents can produce an estimate of the population mean that is no more accurate than an estimate from a simple random sample of size 10. Our central message is that data quality matters far more than data quantity, and compensating the former with the latter is a mathematically provable losing proposition.

RR:C19 Evidence Scale rating by reviewer:

  • Reliable. The main study claims are generally justified by its methods and data. The results and conclusions are likely to be similar to the hypothetical ideal study. There are some minor caveats or limitations, but they would/do not change the major claims of the study. The study provides sufficient strength of evidence on its own that its main claims should be considered actionable, with some room for future revision.



Summary of the Report:

During the COVID-19 pandemic, policymakers rely on accurate surveys to adjust their efforts on related public health issues. Despite the existence of several large-scale online surveys on vaccine-related issues in the U.S., people observed substantial discrepancies among those surveys as well as the benchmark data provided by Centers for Disease Control and Prevention (CDC). This phenomenon recalls the Big Data Paradox: The estimate accuracy decreases when the sample size increases (Meng 2018). The authors utilized a framework recently proposed by Meng 2018 to investigate the source of error for vaccine uptake in those large-scale surveys and to perform a data quality-driven scenario analysis for vaccine willingness and hesitancy.

The authors first analyzed the vaccine uptake data and treat the data provided by CDC as ground truth. In this way, they decomposed the estimate discrepancy into three parts as data quality, data quantity, and problem difficulty. They showed that the major driver to the error is the data quality measured by data defect correlation (ddc). With the estimated ddc from vaccine uptakes, the authors performed a quality-driven scenario analyses for vaccine willingness and hesitancy, which have no underground truth to calculate the ddc’s directly.

Specific Comments:

To estimate hesitancy and willingness using uptake ddc, the authors assume σ 2 v ≈σ 2 H ≈ σ 2 W, which is a reasonable assumption between Feb and May 2021. However, this assumption may not be optimal when the uptake rate is extremely low or high as before early Feb and after late May 2021.

The authors proposed a simple linear relationship to link hesitancy, willingness, and uptake on ddc scale by introducing a tuning parameter λ. With varying λ values, this study posed three scenarios under which they revised estimates of hesitancy and willingness. The author claimed the revised estimates on Facebook and Household Pulse data (Fig 4) shown less discrepancy than the original estimates in Fig 1. However, the results in Fig 1 are on the states level and based on the wave in late March, while results in Fig 4 are on the national level ranging from Jan to late May 2021 and highlight the results on last date as a gray point. It is better to justify the claim. The authors also acknowledged that the proposed analysis framework is unable to determine the optimal scenario. It will be interesting to explore whether the ddc revised estimates decrease the difference at the state-level between the two surveys and the possibility of using the state-level correlation as a metric to find the optimal λ value.


This study sheds some light on the Big Data Paradox for those large survey studies regarding the attitude and behaviors for COVID-19 vaccines and highlights the importance of data quality to achieve reliable results. This study also introduces a mechanism-driven framework assessing hesitancy and willingness under different scenarios, which provides policymakers a better understanding of the potential range of the estimates.

No comments here
Why not start the discussion?