Skip to main content
SearchLoginLogin or Signup

Review 2: "Estimating the Reproduction Number and Transmission Heterogeneity from the Size Distribution of Clusters of Identical Pathogen Sequences"

Reviewers find the proposed method to be novel and validated with synthetic and historical epidemic data. However, they expressed concerns about the uncertainty in quantifying the magnitude of the estimation bias and the validity of this method in the case of an outbreak.

Published onMar 13, 2024
Review 2: "Estimating the Reproduction Number and Transmission Heterogeneity from the Size Distribution of Clusters of Identical Pathogen Sequences"
1 of 2
key-enterThis Pub is a Review of
Estimating the reproduction number and transmission heterogeneity from the size distribution of clusters of identical pathogen sequences
Estimating the reproduction number and transmission heterogeneity from the size distribution of clusters of identical pathogen sequences

Abstract Quantifying transmission intensity and heterogeneity is crucial to ascertain the threat posed by infectious diseases and inform the design of interventions. Methods that jointly estimate the reproduction number R and the dispersion parameter k have however mainly remained limited to the analysis of epidemiological clusters or contact tracing data, whose collection often proves difficult. Here, we show that clusters of identical sequences are imprinted by the pathogen offspring distribution, and we derive an analytical formula for the distribution of the size of these clusters. We develop and evaluate a novel inference framework to jointly estimate the reproduction number and the dispersion parameter from the size distribution of clusters of identical sequences. We then illustrate its application across a range of epidemiological situations. Finally, we develop a hypothesis testing framework relying on clusters of identical sequences to determine whether a given pathogen genetic subpopulation is associated with increased or reduced transmissibility. Our work provides new tools to estimate the reproduction number and transmission heterogeneity from pathogen sequences without building a phylogenetic tree, thus making it easily scalable to large pathogen genome datasets.Significance statement For many infectious diseases, a small fraction of individuals has been documented to disproportionately contribute to onward spread. Characterizing the extent of superspreading is a crucial step towards the implementation of efficient interventions. Despite its epidemiological relevance, it remains difficult to quantify transmission heterogeneity. Here, we present a novel inference framework harnessing the size of clusters of identical pathogen sequences to estimate the reproduction number and the dispersion parameter. We also show that the size of these clusters can be used to estimate the transmission advantage of a pathogen genetic variant. This work provides crucial new tools to better characterize the spread of pathogens and evaluate their control.

RR:C19 Evidence Scale rating by reviewer:

  • Reliable. The main study claims are generally justified by its methods and data. The results and conclusions are likely to be similar to the hypothetical ideal study. There are some minor caveats or limitations, but they would/do not change the major claims of the study. The study provides sufficient strength of evidence on its own that its main claims should be considered actionable, with some room for future revision.


Review: The authors introduce a novel method that estimates R and k for a pathogen using sequencing data. The idea is a welcome alternative to relying on contact tracing data, which is often difficult to obtain. The method, as presented is technically sound. However, I have some concerns in how it would be implemented and how often it will actually be useful and lead to accurate results, particularly in emerging pathogen settings, when R is more likely to be large.

Major comments:

  1. There appear to be issues with clusters of size 1-2 that should really be much bigger because \rho (prob of sequencing) is really small. It seems that too much weight will be put on this and there is not much signal. The authors note this introduces a lot of bias and it is likely, under current surveillance situations, that this is a pervasive challenge.

  2. While the authors present a thoughtful framework for estimating p, this still seems a significant challenge and has the potential to sway results significantly. How is uncertainty around this incorporated into the estimation framework?

  3. Assumption of identical sequences has different implications for different diseases-this might be more reasonable for TB compared to diseases that evolve really quickly. How identical do they need to be?

  4. What about potential for multiple infectors in a cluster? This method is estimating R assuming a single infector in each cluster, even though there seems to be some exploration of the contribution of the most infectious individual to a cluster (page 4).

  5. One challenge with this approach is that certain strains are less likely to have symptoms or are less severe and potentially less infectious. These will not be well-represented in the analysis and could introduce bias. This could have implications for the analysis of transmission advantage between strains if the sampling fraction for each strain is different (which they likely will be if there is a true transmission advantage). Further, if k is different for each of the strains this could also have implications since the likelihood of being sampled is typically related to network degree and strains with lower k will tend to create clusters with higher degree nodes. Sampling probabilities are not really discussed and this seems to have significant bearing on interpreting and results from this analysis.

  6. In general, it is a limitation that the method assumes that those that are sampled are representative of unsampled isolates.

  7. All examples are from situations where R is at a subcritical level and it would be expected that this method would perform best in these circumstances. It is less clear if this will perform better when R is larger (likely it will not, due to the threshold imposed by p). Does that imply that this method is really best suited for scenarios where a disease is well-contained? This seems contrary to other statements about this potentially being a method suited to an outbreak setting (where R is typically great than 1).

  8. It appears that an estimate of the generation time is required for this method. If this is the case, more discussion of how this would be inferred, particularly with a novel pathogen and how sensitive the method is to this would be helpful.

1 of 4
No comments here
Why not start the discussion?