RR:C19 Evidence Scale rating by reviewer:
Reliable. The main study claims are generally justified by its methods and data. The results and conclusions are likely to be similar to the hypothetical ideal study. There are some minor caveats or limitations, but they would/do not change the major claims of the study. The study provides sufficient strength of evidence on its own that its main claims should be considered actionable, with some room for future revision.
It is possible that humans may source just as many viruses to animals as they receive via zoonotic transmission. Generalist viruses are characterized by reduced barriers to cross-species spillover than specialist counterparts. In this manuscript, the authors claim that (a) humans source more viruses to animal hosts than they receive via zoonotic transmission and that (b) generalist viruses face fewer mutational barriers to cross-species spillover than do specialist viruses. Additionally, they find (c) heterogeneity across viruses and virus families in the genes under selection for adaptation associated with host jumps.
In this manuscript, the authors take a novel approach to try to understand correlates associated with virus host jumps. To this end, they download a massive quantity of viral genomes from NCBI (11 million for the first broad analysis, then 54000 for the investigation of drivers of host jumps), which they group into “viral cliques” using a novel network theory approach. They next focus in on viral cliques for which at least 2 different host species have been identified and which are made up of >10 total genomes. Within each of these subsampled viral cliques, the authors then construct a maximum likelihood phylogeny, from which they carry out ancestral state reconstruction to predict the most probable host species of each internal node on the phylogeny. Using these phylogenies, they identify a “host shift” as a virus with differing hosts at the tip and the ancestral node. They use these differing node-tip pairs for downstream analysis, paired with a randomly selected subsample of non-jumping node-tip pairs within the same clique for comparison. They first use paired t-tests to compare the mean number of zoonotic vs. anthroponotic transitions within each virus clique. They then compare the mutational distance, as well as the dN/dS ratio, from tip to ancestor in host jump vs. non-jump pairs. They use Poisson regressions to identify predictor variables of host range (the total number of hosts identified within a clique), including mutational distance and dN/dS, and linear regressions to test for predictors of dN/dS across different genes within 4 key viral families: coronaviridae, paramyxoviridae, rhabdoviridae, and circoviridae.
The broad conclusions of the paper can be summarized as:
The authors find support for more frequent cases of human-to-animal spillover than vice versa.
The authors demonstrate higher mutational distance and higher dN/dS in virus pairs associated with host jumps vs. non-jumps.
The authors additionally find that the mutational threshold and dN/dS ratio between node and tips in the case of host jumps decreases at broader host ranges, suggesting that generalist viruses have lower adaptation barriers for transmission to new hosts.
Next, the authors focus on 4 virus families to show that the strength of selection (dN/dS) is strongest in structural genes for CoVs and paramyxovirses during host jumps and for auxiliary genes in rhabdo and circoviruses. This result is contrary to their expectations where they assumed regions of the genome associated with virus entry would be under the strongest selection in cases of host switching.
Finally, the authors support a hypothesis that adaptive changes in the genome should be localized to regions of functional importance by showing that dN/dS is highest in the RBD vs. other regions of the spike protein for the viral clique that includes SARS-CoV-2.
This manuscript considers a commendable number of virus genomes and undertakes novel approaches to virus taxonomic characterization which appear to represent significant advances in the field. The results are not hugely surprising – analyses #2-5 essentially demonstrate that host switching is more likely for generalist viruses, which are defined as viruses with large host ranges (e.g. those that have switched hosts many times). It is logical that mutational distances are larger when comparing two viruses which infect two different host species vs. two viruses which infect two of the same hosts; nonetheless, it is useful to confirm and quantify this.
In the case of analysis #1, the finding that humans transmit viruses to all other species more frequently than the inverse, I am uncomfortable with this conclusion derived entirely from these ML phylogenies with IQ-TREE. While this approach is the only practical solution for such an impressively large dataset, it would inspire more confidence to see this result tested using BEAST on a smaller subset of the data to demonstrate that humans are not being reconstructed as the ancestral hosts in these phylogenies so frequently simply because—as the authors note—human-derived viruses account for 91% of the data. I would like to see them narrow the search a bit and, again, repeat some of the analyses with a Bayesian approach. For example, the authors could compare within a host clade (e.g. order): do humans transmit more viruses to rodents than rodents to humans? What about for bats? Or other non-human primates? It also might be helpful to parse by mechanisms of transmission – the authors offer some explanations in the Discussion for why humans might source more viruses to animals than the inverse, but many of them (e.g. agricultural runoff) are only plausible for environmentally-transmitted pathogens.
Given the likely gaps in the phylogenies, I am also curious how the role of intermediate hosts was considered here. For example, was mutational distance measured for SARS-CoV-1 between civet and human host sequences? And, by contrast, was it measured between bat and human host sequences for SARS-CoV-2? In the case of the latter, given that a possible intermediate host could be missing from the dataset, how might this bias conclusions? Certainly, I would expect the latter mutational distance to be larger. How is this accounted for?
Finally, the github repo currently has no instructions posted in the README file and appears to be missing the associated datasets. Due to the large dataset and complex analyses, it is sometimes hard to tell in this manuscript exactly what analysis was performed— for example, is dN/dS calculated by comparing an ancestral state sequence to a tip? Thus, it would be helpful to have a more reproducible github to work with.