RR:C19 Evidence Scale rating by reviewer:
Reliable. The main study claims are generally justified by its methods and data. The results and conclusions are likely to be similar to the hypothetical ideal study. There are some minor caveats or limitations, but they would/do not change the major claims of the study. The study provides sufficient strength of evidence on its own that its main claims should be considered actionable, with some room for future revision.
***************************************
Review:
Summary of the Report:
During the COVID-19 pandemic, policymakers rely on accurate surveys to adjust their efforts on related public health issues. Despite the existence of several large-scale online surveys on vaccine-related issues in the U.S., people observed substantial discrepancies among those surveys as well as the benchmark data provided by Centers for Disease Control and Prevention (CDC). This phenomenon recalls the Big Data Paradox: The estimate accuracy decreases when the sample size increases (Meng 2018). The authors utilized a framework recently proposed by Meng 2018 to investigate the source of error for vaccine uptake in those large-scale surveys and to perform a data quality-driven scenario analysis for vaccine willingness and hesitancy.
The authors first analyzed the vaccine uptake data and treat the data provided by CDC as ground truth. In this way, they decomposed the estimate discrepancy into three parts as data quality, data quantity, and problem difficulty. They showed that the major driver to the error is the data quality measured by data defect correlation (ddc). With the estimated ddc from vaccine uptakes, the authors performed a quality-driven scenario analyses for vaccine willingness and hesitancy, which have no underground truth to calculate the ddc’s directly.
Specific Comments:
To estimate hesitancy and willingness using uptake ddc, the authors assume σ 2 v ≈σ 2 H ≈ σ 2 W, which is a reasonable assumption between Feb and May 2021. However, this assumption may not be optimal when the uptake rate is extremely low or high as before early Feb and after late May 2021.
The authors proposed a simple linear relationship to link hesitancy, willingness, and uptake on ddc scale by introducing a tuning parameter λ. With varying λ values, this study posed three scenarios under which they revised estimates of hesitancy and willingness. The author claimed the revised estimates on Facebook and Household Pulse data (Fig 4) shown less discrepancy than the original estimates in Fig 1. However, the results in Fig 1 are on the states level and based on the wave in late March, while results in Fig 4 are on the national level ranging from Jan to late May 2021 and highlight the results on last date as a gray point. It is better to justify the claim. The authors also acknowledged that the proposed analysis framework is unable to determine the optimal scenario. It will be interesting to explore whether the ddc revised estimates decrease the difference at the state-level between the two surveys and the possibility of using the state-level correlation as a metric to find the optimal λ value.
Conclusion:
This study sheds some light on the Big Data Paradox for those large survey studies regarding the attitude and behaviors for COVID-19 vaccines and highlights the importance of data quality to achieve reliable results. This study also introduces a mechanism-driven framework assessing hesitancy and willingness under different scenarios, which provides policymakers a better understanding of the potential range of the estimates.