Skip to main content
SearchLoginLogin or Signup

Review 1: "Risk Factors Associated with Post-Acute Sequelae of SARS-CoV-2 in an EHR Cohort: A National COVID Cohort Collaborative (N3C) Analysis as Part of the NIH RECOVER Program"

Reviewers find this study to be potentially informative to reliable and highlight the study’s limited generalizability and potential introduction of bias in the control selection process.

Published onOct 11, 2022
Review 1: "Risk Factors Associated with Post-Acute Sequelae of SARS-CoV-2 in an EHR Cohort: A National COVID Cohort Collaborative (N3C) Analysis as Part of the NIH RECOVER Program"
1 of 2
key-enterThis Pub is a Review of
Risk Factors Associated with Post-Acute Sequelae of SARS-CoV-2 in an EHR Cohort: A National COVID Cohort Collaborative (N3C) Analysis as part of the NIH RECOVER program

ABSTRACTBackgroundMore than one-third of individuals experience post-acute sequelae of SARS-CoV-2 infection (PASC, which includes long-COVID).ObjectiveTo identify risk factors associated with PASC/long-COVID.DesignRetrospective case-control study.Setting31 health systems in the United States from the National COVID Cohort Collaborative (N3C).Patients8,325 individuals with PASC (defined by the presence of the International Classification of Diseases, version 10 code U09.9 or a long-COVID clinic visit) matched to 41,625 controls within the same health system.MeasurementsRisk factors included demographics, comorbidities, and treatment and acute characteristics related to COVID-19. Multivariable logistic regression, random forest, and XGBoost were used to determine the associations between risk factors and PASC.ResultsAmong 8,325 individuals with PASC, the majority were >50 years of age (56.6%), female (62.8%), and non-Hispanic White (68.6%). In logistic regression, middle-age categories (40 to 69 years; OR ranging from 2.32 to 2.58), female sex (OR 1.4, 95% CI 1.33-1.48), hospitalization associated with COVID-19 (OR 3.8, 95% CI 3.05-4.73), long (8-30 days, OR 1.69, 95% CI 1.31-2.17) or extended hospital stay (30+ days, OR 3.38, 95% CI 2.45-4.67), receipt of mechanical ventilation (OR 1.44, 95% CI 1.18-1.74), and several comorbidities including depression (OR 1.50, 95% CI 1.40-1.60), chronic lung disease (OR 1.63, 95% CI 1.53-1.74), and obesity (OR 1.23, 95% CI 1.16-1.3) were associated with increased likelihood of PASC diagnosis or care at a long-COVID clinic. Characteristics associated with a lower likelihood of PASC diagnosis or care at a long-COVID clinic included younger age (18 to 29 years), male sex, non-Hispanic Black race, and comorbidities such as substance abuse, cardiomyopathy, psychosis, and dementia. More doctors per capita in the county of residence was associated with an increased likelihood of PASC diagnosis or care at a long-COVID clinic. Our findings were consistent in sensitivity analyses using a variety of analytic techniques and approaches to select controls.ConclusionsThis national study identified important risk factors for PASC such as middle age, severe COVID-19 disease, and specific comorbidities. Further clinical and epidemiological research is needed to better understand underlying mechanisms and the potential role of vaccines and therapeutics in altering PASC course.KEY POINTSQuestionWhat risk factors are associated with post-acute sequelae of SARS-CoV-2 (PASC) in the National COVID Cohort Collaborative (N3C) EHR Cohort?FindingsThis national study identified important risk factors for PASC such as middle age, severe COVID-19 disease, specific comorbidities, and the number of physicians per capita.MeaningClinicians can use these risk factors to identify patients at high risk for PASC while they are still in the acute phase of their infection and also to support targeted enrollment in clinical trials for preventing or treating PASC.

RR:C19 Evidence Scale rating by reviewer:

  • Potentially informative. The main claims made are not strongly justified by the methods and data, but may yield some insight. The results and conclusions of the study may resemble those from the hypothetical ideal study, but there is substantial room for doubt. Decision-makers should consider this evidence only with a thorough understanding of its weaknesses, alongside other evidence and theory. Decision-makers should not consider this actionable, unless the weaknesses are clearly understood and there is other theory and evidence to further support it.




This is a large electronic health record (EHR)-based, case-control study based out of US 31 health systems that aimed to identify risk factors for long COVID/PASC in the NC3 Cohort, which is part of the NIH’s RECOVER initiative. Rigorously derived information on long COVID risk factors is extremely important for long COVID prevention, given that 20-30% of people with SARS-CoV-2 infection go on to develop long COVID, and there are currently no treatments. It may also help to inform much needed diagnostics and treatments. The public health burden of long COVID in the US is quite high (an estimated 7.3-7.5% of all US adults may have it, according to the Household Pulse Survey and a recent study by our team at the City University of New York). Knowing more about risk factors can help prevent long COVID (e.g., via better target vaccination/boosting efforts) and to help providers identify who may be most at risk for developing long COVID, given a SARS-CoV-2 infection.


1.CDC. Long COVID. Household Pulse Survey

2.Robertson et al. The epidemiology of long COVID in US adults two years after the start of the US SARS-CoV-2 pandemic. 2022. medRxiv. doi:


Strengths of the study include the ability to include persons with confirmed COVID diagnoses, as well as leveraging of information on potential risk factors that was documented in the EHR prior to the COVID diagnosis (which helps establish temporality).


Because the study is EHR-based, the universe of potential risk factors examined is limited, and because the study was health care system based, a substantial proportion of the study population had severe COVID, including 37% of those classified as having long COVID. Because of this, and because many people with SARS-CoV-2 infection don’t test for SARS-CoV-2 or access health care, the study’s generalizability to all people with COVID may be limited. The validity of the long COVID outcome definition (an ICD-10 code for long COVID or being seen at a long COVID clinic) needs to be established/clarified. There may be non-ignorable bias introduced by the control selection process, despite substantive efforts by the authors to limit it. The predictive modelling was under-informed by theory, hypothesized causal associations, basic science evidence or other epidemiological evidence to date on likely causes and risk factors for long COVID. Rather than identifying long COVID risk factors, I think the study has largely identified EHR-based predictors of a long COVID diagnosis in the health system. This does not reduce the significance/importance of the work, but the distinction is important. Predictive modelling is useful for identifying people who are more likely to develop long COVID based on what is happening with patients prior to and during the acute phase. A useful aim. However, predictive models are developed mechanically, and they ignore and often obscure true causal relationships/associations. This makes the term ‘risk factor’ as used in the paper to describe associations problematic.

Study population:

Specifically, the study examined EHR-derived predictors of long COVID among 8,325 individuals classified as having long COVID (based on long COVID ICD-10 codes or long COVID clinic attendance) and three different control groups (presumed to not have long COVID) selected from the EHR. The universe of those in the EHR who had a documented SARS-CoV-2 diagnosis or positive PCR test that included 1,062,661 individuals with SARS-CoV-2, of whom 8,325 classified as long COVID cases and 1,054,336 as possible controls without long COVID.

Outcome definition:

To my knowledge, the utility of the U09.9 ICD-10 code in terms of its sensitivity and specificity for classifying individuals with a history of SARS-CoV-2 as having long COVID is not yet known/established, and deserves comment from the team. It would be helpful to know how it compares to a symptom-based definition (e.g., the WHO working case definition) and also why specific symptoms known to be hallmarks of long COVID not used by the team? There is a risk of introducing bias through misclassification of cases as controls (noted by the authors) and vice versa (outcome misclassification). Since there is no gold standard definition for classifying individuals as having long COVID, it would be important to know more about the clinical characteristics (number and types of long COVID symptoms) among the cases, and to have a sense of how specific the case definition might be (important for assessing bias in case control studies).

Control selection:

In case-control studies, there is a high risk of introducing bias through the control selection process, which the authors rightly acknowledge. Based on the numbers provided, about 8,325/1,054,336 (7.9%) of those with SARS-CoV-2 infection developed long COVID according to the outcome classifications used. Given that 20- 30% of people with SARS-CoV-2 would be expected to go onto develop long COVID, it seems likely that a substantial proportion of people in the control group likely had long COVID, but were not captured by their long COVID case definition. Accordingly, the authors examined three different control groups, with increasing specificity. The three different control groups were intended to assess the potential for bias, as they used increasingly specific control selection algorithms. However, the validity of the algorithms for this purpose (i.e., excluding undiagnosed long COVID) is unclear, given the challenges with long COVID diagnosis and lack of a gold standard. Knowing something about the specificity of the algorithm for excluding possible cases of long COVID from the control group is important in order to assess the potential for bias.

It would be important to clarify whether the control group had adequate follow-up time such that they had the opportunity to be diagnosed with long COVID if they had it (i.e., equivalent opportunity as cases to meet the long COVID case definition in the study).

In the two restricted control groups, why were only a subset ‘eligible’ for becoming controls? This should be explained/described.

If the 3rd case group classification has the most specificity, why wasn’t that group used for the primary analysis?

Predictors and predictive models:

In predictive models, the study identified several EHR-based predictors of long COVID that differentiated cases from controls. While both severe and less severe COVID cases were included in the main analyses, the authors present analyses stratified by hospitalization status in the appendix which, importantly, showed some differences in predictors by severity. I think that this stratified analysis should be presented in the main paper, not as an appendix. Moreover, these findings suggests that there is effect modification by severity of acute COVID, which is unaccounted for in the main predictive model. If the main model with non-hospitalized and hospitalized patients is retained, there should be a formal assessment of effect modification by severity of acute SARS-CoV-2. It was not clear to me whether this assessment was done in the original analysis, but it seems highly indicated based on what is known about long COVID and severe COVID.

Related to the multivariate models, the models presented are predictive models and not causal models. It can be problematic to refer to associations identified through predictive models as ‘risk factors’, because some important and potentially causal associations can be adjusted away or their magnitude diminished when every single measured variable is included in the model. In predictive models, it may be misleading or incorrect to conclude that a variable (e.g., diabetes) which is not associated with the outcome in the final model is not a ‘risk factor’ for long COVID that may be in need of additional investigation. Associations from predictive models such as these can simply be referred to as ‘independent predictors’ of the study outcome, not ‘risk factors’. These predictors can be used by clinicians and health systems to help identify those who may be more likely to develop long COVID after the acute phase of infection.

In light of these concerns, the authors should clarify in the limitations that many potentially important predictors, including many that are supported by research, hypothesized to be potentially causally related to long COVID risk were not examined in the study, including the role of vaccination/boosting status, prior infection, the use of antivirals, variant era, the peak viral load in the acute phase of infection, EBV reactivation/viremia, autoantibodies, and diabetes. The lack of framing of the analysis plan around what is known and hypothesized around causes of long COVID from the literature seems like a missed opportunity.

Some specific issues that the authors could address to improve the quality of the manuscript include:

  1. Outcome and model validity should be addressed, given lack of gold standard diagnostic classification system for long COVID.

  1. Speak to potential control selection concerns, including whether controls had adequate follow-up time to be able to meet the case definition. E.g., if they were not seen again in the health system at all after the acute phase of infection, they can likely not be accurately classified as a case or a control (long COVID or not). Along these lines, it would be helpful to describe the distribution of the length of follow-up and number of visits among cases vs. controls.

  2. I would strongly suggest referring to associations in the manuscript as ‘predictors of long COVID diagnosis’ as opposed to ‘risk factors for PASC’, which could be misleading.

  3. If the main model with both non-hospitalized and hospitalized patients is retained, there should be a formal assessment of effect modification by severity of acute SARSCoV-2.

  4. Include more descriptive epidemiology of the 8,325 individuals classified as having long COVID, including by calendar time (an epidemic curve), reporting the distribution of time from infection to dx/first long COVID clinic visit, estimated duration, severity, and symptomatology.

  5. The authors should present the crude associations alongside the adjusted ones.

  6. Include some individual level social determinants of health that likely are available in the EHR, such as insurance status, and Medicaid status.

  7. There should be some mention in the introduction and/or discussion of the need of strategies to prevent long COVID via more targeted vaccination and boosting, given its severity and lack of treatment. Vaccines both reduce the risk of infection and emerging evidence strongly suggests that vaccination/boosters reduce the risk of long COVID given a breakthrough infection. Yet, this wasn’t mentioned as a potential implication of the findings.

No comments here
Why not start the discussion?