RR:C19 Evidence Scale rating by reviewer:
Not informative. The flaws in the data and methods in this study are sufficiently serious that they do not substantially justify the claims made. It is not possible to say whether the results and conclusions would match that of the hypothetical ideal study. The study should not be considered as evidence by decision-makers.
***************************************
Review:
The authors mention that they classify CXRs into COVID-19 positive and negative categories. How do they quantify the negative class? What are the proportion of healthy controls and other disease manifestations included in the negative class? The histogram of prediction confidence shows that the model is highly confident in separating positive and negative classes. In this regard, the negative class may contain a major proportion of healthy controls. However, these discussions are missing at present.
It is not mentioned whether the datasets would be made publicly available.
The authors have to justify the selection of pretrained models. It is not clear why the authors performed multiple iterations with and without preprocessing for the ResNet-50 model. What is the performance achieved under these circumstances with other pretrained models? There is no discussion of a baseline model in this study. The authors need to perform a statistical significance analysis to demonstrate if the performance difference is statistically significant.
The rationale behind the study is not clear. Do the authors want to infer that ResNet-50 is the best model to classify COVID-19 cases from other classes? How does the trained model generalize to real-world applications?
It is not clear why the authors created a three-channel image with one channel filled with zeroes. Instead, the authors could have segmented the lung regions and used them to train the models.
What limitations are the authors proposing to overcome as compared to the literature? The authors may have to test the performance with a different collection that is not part of the training or validation process to demonstrate generalization.
COVID-19 is a case of viral pneumonia; however, pneumonia is also caused by bacterial, non-COVID-viral, and other pathogens. The authors need to demonstrate how these models behave for these distinct types of pneumonia.
What do the authors mean by one of the last layers? How did the authors optimize feature selection? It is not clear if the authors performed online or offline augmentation. If offline, what is the percentage increase in the training data? What is the performance with and without augmentation and are the performance differences statistically significant?
How did the authors decide to choose the best layer to extract the features? What is the difference in performance observed by extracting features from different intermediate layers? Are these differences statistically significant? The authors need to explain the reason behind these choices if they feel something is better than the other.
The authors mentioned that they used the highest possible resolution for this study. However, they mention that they cropped the images to 1024 x 1024 dimensions. This contradicts their claim. What is the effect of increasing/decreasing input spatial resolution?
How did the authors make sure that the model arrives at the right decision for the right reason? It is important to perform saliency tests and compute class activation maps to localize the regions of interest learned by the pretrained models. The issues with reporting such high performances in detecting image-level COVID-19 labels with publicly available COVID-19 collections have been adequately discussed in the literature. Fundamentally, these CNN models are not learning the characteristics of the disease but that of the datasets, due to severe class imbalance issues and other constraints.