Skip to main content
SearchLoginLogin or Signup

Review 5: "Estimating the Reproduction Number and Transmission Heterogeneity from the Size Distribution of Clusters of Identical Pathogen Sequences"

Reviewers find the proposed method to be novel and validated with synthetic and historical epidemic data. However, they expressed concerns about the uncertainty in quantifying the magnitude of the estimation bias and the validity of this method in the case of an outbreak.

Published onMar 20, 2024
Review 5: "Estimating the Reproduction Number and Transmission Heterogeneity from the Size Distribution of Clusters of Identical Pathogen Sequences"
1 of 2
key-enterThis Pub is a Review of
Estimating the reproduction number and transmission heterogeneity from the size distribution of clusters of identical pathogen sequences
Estimating the reproduction number and transmission heterogeneity from the size distribution of clusters of identical pathogen sequences

Abstract Quantifying transmission intensity and heterogeneity is crucial to ascertain the threat posed by infectious diseases and inform the design of interventions. Methods that jointly estimate the reproduction number R and the dispersion parameter k have however mainly remained limited to the analysis of epidemiological clusters or contact tracing data, whose collection often proves difficult. Here, we show that clusters of identical sequences are imprinted by the pathogen offspring distribution, and we derive an analytical formula for the distribution of the size of these clusters. We develop and evaluate a novel inference framework to jointly estimate the reproduction number and the dispersion parameter from the size distribution of clusters of identical sequences. We then illustrate its application across a range of epidemiological situations. Finally, we develop a hypothesis testing framework relying on clusters of identical sequences to determine whether a given pathogen genetic subpopulation is associated with increased or reduced transmissibility. Our work provides new tools to estimate the reproduction number and transmission heterogeneity from pathogen sequences without building a phylogenetic tree, thus making it easily scalable to large pathogen genome datasets.Significance statement For many infectious diseases, a small fraction of individuals has been documented to disproportionately contribute to onward spread. Characterizing the extent of superspreading is a crucial step towards the implementation of efficient interventions. Despite its epidemiological relevance, it remains difficult to quantify transmission heterogeneity. Here, we present a novel inference framework harnessing the size of clusters of identical pathogen sequences to estimate the reproduction number and the dispersion parameter. We also show that the size of these clusters can be used to estimate the transmission advantage of a pathogen genetic variant. This work provides crucial new tools to better characterize the spread of pathogens and evaluate their control.

RR:C19 Evidence Scale rating by reviewer:

  • Reliable. The main study claims are generally justified by its methods and data. The results and conclusions are likely to be similar to the hypothetical ideal study. There are some minor caveats or limitations, but they would/do not change the major claims of the study. The study provides sufficient strength of evidence on its own that its main claims should be considered actionable, with some room for future revision.


Review: The authors present a method for jointly estimating the reproduction number and the transmission heterogeneity (the dispersion parameter of the offspring distribution) in a disease outbreak, using the size distribution of genomic clusters defined as those cases with identical pathogen sequences. This approach does not require building a phylogeny, or collecting contact tracing data, and so is argued to be more scalable than these current approaches.

Notes on overall writing:

  • The introduction provides solid motivation for their approach, i.e. limitations of current methods using case time series, identified transmission chains or phylodynamics.

  • The methodology is clear, building upon current approaches in the field and including sufficient genomics background for potentially unfamiliar researchers wanting to employ their method. All code and data are provided. 

  • All results and claims are supported by a description of the methodology used. The authors present several simulation studies and analyses of real data. There is sufficiently detailed consideration of statistical uncertainty.

Notes on model: 

  • The assumption that 2 identical sequences are always linked within an epidemiological cluster seems a strong one. The authors are clear about this, but it seems to be me that this assumption may very commonly be violated in practice, limiting the applicability of the method. I imagine this assumption may have been violated in the COVID-19 example in the paper for example?

  • Perhaps this could be improved upon with some guidance for selection of the scale of data for this method. 

  • The authors discuss how their method performs poorly when R>1/p. Since R is one of the parameters we are trying to estimate and p is, I believe, being estimated also from data, it seems it would be difficult to assess this threshold in real applications. 

  • I did not have access to the supplementary materials for my review, so apologies if this was covered there, but it seems that p is the crux of this method and, potentially, would be impacted by many factors including testing policy/efficiency varying in time, strength of symptoms etc. A more detailed discussion on how an adjustment for partial observation bias impacts the estimates, and the model's sensitivity to various levels of non-uniform sequence coverage would be helpful.

Although not the focus of this review, we make some small suggestions below:

  • I’m not clear on what figure 1A is supposed to be showing. I might suggest additional labels or caption, or perhaps removing it. Is this trying to demonstrate how infectees might have identical or non-identical sequences to their infector?

  • The authors are clear in the text about the strengths and limitations (or more generally, the quantified results) of their approach. It would be helpful to reflect on this in either the abstract or introduction, which currently don’t discuss where and how well the method works. 

  • Combining 2 of the Model points above, could the behaviour around the threshold 1/p be improved upon by better adjusting for partial observation bias? It seems this would improve the applicability of the method.

1 of 4
No comments here
Why not start the discussion?