RR:C19 Evidence Scale rating by reviewer:
Potentially informative. The main claims made are not strongly justified by the methods and data, but may yield some insight. The results and conclusions of the study may resemble those from the hypothetical ideal study, but there is substantial room for doubt. Decision-makers should consider this evidence only with a thorough understanding of its weaknesses, alongside other evidence and theory. Decision-makers should not consider this actionable, unless the weaknesses are clearly understood and there is other theory and evidence to further support it.
***************************************
Review:
In this manuscript, the authors attempted to identify SRAS-CoV-2 mutations associated with severe outcomes by analyzing SARS-CoV-2 sequences deposited in GISAID during January and October 2020. This is important to understand the variabilities in disease outcomes among COVID-19 patients. The authors identified certain mutations that were associated with severe cases. However, the results should be corrected based on patients’ ages, gender, and associated comorbidities (if available). In other words, how can the authors be sure that the severity of symptoms is due to viral mutations and not due to other host-related factors such as age, gender, and comorbidities? Therefore, the 2,870 severe cases should be further stratified based on other risk determining factors (age and underlying comorbidities). Analysis should compare between age- and gender-matched mild and severe cases to exclude the effect of host-related factors. Also, severe cases should be sub-grouped based on their comorbidities.
Overall, the focus of the study is clear but extensive editing of English language and style is required particularly for the introduction and results. The approach followed for sample inclusion/exclusion and sequences and mutations analysis is reasonable. The results, on the other hand, are not sufficiently described. More details are needed to confirm the conclusion(s), and further, (yet simple) analysis may need to be conducted. Detailed information regarding mutation types (synonymous vs non-synonymous), their specific genomic position (gene), and their prevalence (severe vs mild cases) should be added to the results section. The genomic position of severity-associated mutations can explain their possible effect. Moreover, clades and mutations used for prediction analysis should be indicated clearly in the text.
The CDC has already identified clades of concern (variants of concern) based on several factors including the increased transmissibility and pathogenicity. Therefore, it is necessary to identify and compare the prevalence of these clades between severe and mild cases before proceeding to prediction analysis. Then, a similar prediction analysis approach should be repeated using one variant/clade per analysis to confirm their findings. This will further help to identify clades and/or variants associated with severe cases. The discussion is generally well-written and explains their findings in light of published data. The authors are also aware of the limitations of their study.
Also, there some minor points that the authors should address including:
-The number of sequences indicated in the abstract is misleading. The authors should indicate the actual number of sequences analyzed (n= 3,637 sequences).
-Line 217: More relevant references/examples could be mentioned here. Similar scenarios were seen in other RNA viruses (following influenza 1968 H3N2 and 2009 H1N1 pandemics).
-Some figures (figure 1A and 1B) are not cited in the text.
-Line 186: Can you clarify why this number of variants (4499) was selected for the model testing??
-Authors mentioned that they have used SnpEff for annotating mutations. They could have further utilized this tool to predict the impact of these mutations.