Review 1: "Explainable Machine Learning to Identify Patients at Risk of Developing Hospital Acquired Infections"

Published onFeb 12, 2025
Description

Abstract Hospital-acquired infections (HAIs) contribute to increased mortality rates and extended hospital stays. Patients with complex neurological impairments, secondary to conditions such as acquired brain injury or progressive degenerative conditions are particularly prone to HAIs and often have the worst resulting clinical outcomes and highest associated cost of care. Research indicates that the prompt identification of such infections can significantly mitigate mortality rates and reduce hospitalisation duration. The current standard of care for timely detection of HAIs for inpatient acute and post-acute care settings in the UK is the National Early Warning Score v02 (NEWS2). NEWS2, despite its strengths, has been shown to have poor prognostic accuracy for specific indications, such as infections. This study developed a machine learning (ML) based risk stratification tool, utilising routinely collected patient electronic health record (EHR) data, encompassing over 800+ patients and 400k+ observations collected across 4-years, aimed at predicting the likelihood of infection in patients within an inpatient care setting for patients with complex acquired neurological conditions. Built with a combination of historical patient data, clinical coding, observations, clinician reported outcomes, and textual data, we evaluated our framework to identify individuals with an elevated risk of infection within a 7-day time-frame, retrospectively over a 1-year “silent-mode” evaluation. We investigated several time-to-event model configurations, including manual feature-based and data-driven deep generative techniques, to jointly estimate the timing and risk of infection onset. We observed strong performance of the models developed in this study, achieving high prognostic accuracy and robust calibration from 72–6 hours prior to clinical suspicion of infection, with AUROC values ranging from 0.776–0.889 and well-calibrated risk estimates exhibited across those time intervals (IBS<0.178). Furthermore, by assigning model-generated risk scores into distinct categories (low, moderate, high, severe), we effectively stratified patients with a higher susceptibility to infections from those with lower risk profiles. Post-hoc explainability analysis provided valuable insights into key risk factors, such as vital signs, recent infection history, and patient age, which aligned well with prior clinical knowledge. Our findings highlight our framework’s potential for accurate and explainable insights, facilitating clinician trust and supporting integration into real-world patient care workflows. Given the heterogeneous and complex patient population, and our under-utilisation of the data recorded in routine clinical notes and lab reports, there are considerable opportunities for performance improvement in future research by expanding our model’s multimodal capabilities, generalisability, and additional model personalisation steps.

RR\ID Evidence Scale rating by reviewer:

  • Potentially informative. The main claims made are not strongly justified by the methods and data, but may yield some insight. The results and conclusions of the study may resemble those from the hypothetical ideal study, but there is substantial room for doubt. Decision-makers should consider this evidence only with a thorough understanding of its weaknesses, alongside other evidence and theory. Decision-makers should not consider this actionable, unless the weaknesses are clearly understood and there is other theory and evidence to further support it.

Review: Overall, while I find that this paper appears promising, it is not ready for publication. From a methodology standpoint, it is far too incremental, simply applying standard tools to the problem of predicting time until hospital acquired infections (HAIs). In doing this application, I do not think that the authors have made a thorough attempt at comparing against baselines that are well-established for trying to solve the same general problem (sure, these existing baselines might not have originally been specifically developed for modeling time until HAIs but they could trivially be applied to the same problem). Moreover, I do not think that the results and discussion on model explanation is thorough enough.

Strengths:

  • The proposed framework appears sound.

  • The experimental results look potentially promising.

Weaknesses:

  • Insufficient baseline comparison: This paper applies fairly standard ML tools for predicting each individual patient's risk of developing a hospital acquired infection. The authors are just proposing using a VAE with different standard survival analysis prediction heads (Cox model, gradient boosting, Nnet-survival, DeepHit), and then use SHAP to explain only the gradient boosted model. Standard survival analysis evaluation metrics are used (c-index, time-dependent AUC, IBS). To me, since the authors have decided to look at a dynamic setup, I do not understand why they did not use dynamic survival analysis baselines such as deep recurrent survival analysis (Ren et al 2019), Dynamic-DeepHit (Lee et al 2020), deep recurrent survival machines (Nagpal et al 2021), or SurvLatent ODE (Moon et al 2022). How do the proposed methods compare with these other established baselines? Also, I think making some sort of honest attempt at comparing against standard classification methods used in other risk score models could be helpful as well (to make a case for why modeling using survival analysis in this context is better than modeling using classification). Along similar lines, I think comparing against survival stacking is important (survival stacking enables one to convert a survival analysis problem to a classification one; see for example Craig et al (2021)). Also, I'll point out that optimal decision trees for survival analysis (Bertsimas et al 2022, Zhang et al 2024) are trivially interpretable and comparing against these would be helpful.

  • In terms of experimental results, GBoostSurv actually does quite well for AUROC and IBS metrics, enough so that I think the authors should much more rigorously justify why in practice, one might favor the VAE models over GBoostSurv. Sure, the VAE models are showing better C-index performance, but C-index is not always the best evaluation metric, and there are other versions of C-index not considered here such as the time-dependent version by Antolini et al (2005).

  • I think it would be helpful if the authors reported evaluation metrics with an additional constraint on how much of a patient's time series we have seen so far. This would be helpful in the clinical context as we see more and more of a patient's time series over time.

  • I think it would be helpful to more carefully discuss model interpretation and how much we should be trusting it in regard to well-known commentary on this topic such as that of Ghassemi et al (2021). I am particularly concerned about when explanations at the individual patient level should be trusted or not. Are there cases where the explanation is misleading? How robust are the explanations to the amount of training vs test data used? When we only see very little of a patient's time series vs as we see more and more of it, when are explanations more worth trusting vs when they are not?

  • The title of the paper starts with "explainable machine learning" yet in the experimental results, only the GBoostSurv is explained (from my understanding, not even the VAE + GBoostSurv is explained so that none of the VAE approaches have actually been explained using SHAP). I think if the paper wants to emphasize explainability, then it should actually apply SHAP or any other explanation tools to more of the models considered and not just GBoostSurv.

  • As already stated in the paper's discussion, data were only sourced from a single setting so that external validity has not been experimentally confirmed.

  1. Antolini et al. A time-dependent discrimination index for survival data. Statistics in Medicine 2005.

  2. Bertsimas et al. Optimal survival trees. Machine Learning 2022.

  3. Craig et al. Survival stacking: casting survival analysis as a classification problem. arXiv 2021.

  4. Ghassemi et al. The false hope of current approaches to explainable artificial intelligence in healthcare. The Lancet Digital Health 2021.

  5. Lee et al. Dynamic-DeepHit: A Deep Learning Approach for Dynamic Survival Analysis With Competing Risks Based on Longitudinal Data. IEEE Transactions on Biomedical Engineering 2020.

  6. Moon et al. SurvLatent ODE : A Neural ODE based time-to-event model with competing risks for longitudinal data improves cancer-associated Venous Thromboembolism (VTE) prediction. MLHC 2022.

  7. Nagpal et al. Deep Parametric Time-to-Event Regression with Time-Varying Covariates. AAAI Spring Symposium 2021.

  8. Ren et al. Deep Recurrent Survival Analysis. AAAI Conference 2019.

  9. Zhang et al. Optimal sparse survival trees. AISTATS 2024.

Another Supplement to Reviews of "Explainable Machine Learning to Identify Patients at Risk of Developing Hospital Acquired Infections"
Review 2: "Explainable Machine Learning to Identify Patients at Risk of Developing Hospital Acquired Infections"
Review 2: "Explainable Machine Learning to Identify Patients at Risk of Developing Hospital Acquired Infections"
by Stefano Orlando
  • Published on Feb 12, 2025
  • rrid.mitpress.mit.edu
Description

The reviewers recommended making the code publicly available, expanding comparisons to more baseline models, and conducting real-world trials to validate the findings before these methods could be considered for clinical use.

