RR:C19 Evidence Scale rating by reviewer:

Potentially informative. The main claims made are not strongly justified by the methods and data, but may yield some insight. The results and conclusions of the study may resemble those from the hypothetical ideal study, but there is substantial room for doubt. Decision-makers should consider this evidence only with a thorough understanding of its weaknesses, alongside other evidence and theory. Decision-makers should not consider this actionable, unless the weaknesses are clearly understood and there is other theory and evidence to further support it.

Review:

With TB endemic in Southern Africa, the authors investigated epidemiological risk factors for tuberculosis (TB) in the Northern Cape Province, South Africa, an understudied TB endemic region with extreme TB incidence and the lowest provincial population density.

There is strong motivation given the need for better understanding of TB risk factors beyond HIV. However, the variables addressed in this preprint have previously been well characterized (e.g. household density, SES) and it is unclear what added value the study presents. For example, case-control studies that go back to Lienhardt in the early ‘90s (https://pubmed.ncbi.nlm.nih.gov/15914505/).

It is unclear how covariates were chosen and on what basis they were included in regression models. The authors could consider an alternate approach to assessing possible confounders (e.g. a directed acyclic graph?) Several risk factors (e.g. neighborhood-level) are not included and perhaps this is the reason such a small proportion of overall case/control status is explained.

There is also insufficient motivation for the interaction (need background regarding why the age and SES effect modification and if multiplicative or additive, similar issue for residence and birthplace). While control groups are well-chosen and based on broad community health clinic recruitment, it is unclear however to what degree these patients are broadly representative of the underlying populations (clinic-based controls only) e.g. <1000 participants vs. 35,000 in a national prevalence survey. It is unclear if the analysis is sufficiently powered as no sample size calculations are provided for the case-control study.

There is an issue with Table 2 Fallacy in the multivariable model interpretation and discussion as examination of various exposures examined when the association at hand should be based on a specific exposure-outcome relationship (Westreich D, Greenland S. The table 2 fallacy: presenting and interpreting confounder and modifier coefficients. Am J Epidemiol. 2013 Feb 15;177(4):292-8.). An alternative approach might be to use some sort of variance partition modeling or principal components analysis.

It is unclear in the methods if genetic ancestries were included in the regression models. Given the multifactorial nature of TB, it is unclear how genetics can be viewed without a more robust gene-environment interaction assessment. Random Forests are a predictive model and should be differentiated from (logistic) regression-based modeling that is meant to infer causality. The authors also should address the issue that when a Random Forest Regressor is tasked with the problem of predicting for values not previously seen, it will always predict an average of the values seen previously.

The preprint provides a unique perspective into local epidemiology of TB in the Northern Cape of South Africa. However, measured risk factors only explain a small degree of the exposure variance between cases and controls.