Skip to main content
SearchLoginLogin or Signup

Review 2: "COVID-19 Classification of X-ray Images Using Deep Neural Networks"

The study employs deep learning techniques to classify chest X-ray images with or without COVID-19. While the techniques were generally accurate, reviewers expressed concern over missing elements that would strengthen the conclusions suggested by the study.

Published onOct 22, 2020
Review 2: "COVID-19 Classification of X-ray Images Using Deep Neural Networks"
1 of 2
key-enterThis Pub is a Review of
COVID-19 Classification of X-ray Images Using Deep Neural Networks

In the midst of the coronavirus disease 2019 (COVID-19) outbreak, chest X-ray (CXR) imaging is playing an important role in the diagnosis and monitoring of patients with COVID-19. Machine learning solutions have been shown to be useful for X-ray analysis and classification in a range of medical contexts. The purpose of this study is to create and evaluate a machine learning model for diagnosis of COVID-19, and to provide a tool for searching for similar patients according to their X-ray scans. In this retrospective study, a classifier was built using a pre-trained deep learning model (ReNet50) and enhanced by data augmentation and lung segmentation to detect COVID-19 in frontal CXR images collected between January 2018 and July 2020 in four hospitals in Israel. A nearest-neighbors algorithm was implemented based on the network results that identifies the images most similar to a given image. The model was evaluated using accuracy, sensitivity, area under the curve (AUC) of receiver operating characteristic (ROC) curve and of the precision-recall (P-R) curve. The dataset sourced for this study includes 2362 CXRs, balanced for positive and negative COVID-19, from 1384 patients (63 +/- 18 years, 552 men). Our model achieved 89.7% (314/350) accuracy and 87.1% (156/179) sensitivity in classification of COVID-19 on a test dataset comprising 15% (350 of 2326) of the original data, with AUC of ROC 0.95 and AUC of the P-R curve 0.94. For each image we retrieve images with the most similar DNN-based image embeddings; these can be used to compare with previous cases.

RR:C19 Evidence Scale rating by reviewer:

Not informative. The flaws in the data and methods in this study are sufficiently serious that they do not substantially justify the claims made. It is not possible to say whether the results and conclusions would match that of the hypothetical ideal study. The study should not be considered as evidence by decision-makers.



The authors mention that they classify CXRs into COVID-19 positive and negative categories. How do they quantify the negative class? What are the proportion of healthy controls and other disease manifestations included in the negative class? The histogram of prediction confidence shows that the model is highly confident in separating positive and negative classes. In this regard, the negative class may contain a major proportion of healthy controls. However, these discussions are missing at present.

It is not mentioned whether the datasets would be made publicly available.
The authors have to justify the selection of pretrained models. It is not clear why the authors performed multiple iterations with and without preprocessing for the ResNet-50 model. What is the performance achieved under these circumstances with other pretrained models? There is no discussion of a baseline model in this study. The authors need to perform a statistical significance analysis to demonstrate if the performance difference is statistically significant.

The rationale behind the study is not clear. Do the authors want to infer that ResNet-50 is the best model to classify COVID-19 cases from other classes? How does the trained model generalize to real-world applications?
It is not clear why the authors created a three-channel image with one channel filled with zeroes. Instead, the authors could have segmented the lung regions and used them to train the models.

What limitations are the authors proposing to overcome as compared to the literature? The authors may have to test the performance with a different collection that is not part of the training or validation process to demonstrate generalization.

COVID-19 is a case of viral pneumonia; however, pneumonia is also caused by bacterial, non-COVID-viral, and other pathogens. The authors need to demonstrate how these models behave for these distinct types of pneumonia.
What do the authors mean by one of the last layers? How did the authors optimize feature selection? It is not clear if the authors performed online or offline augmentation. If offline, what is the percentage increase in the training data? What is the performance with and without augmentation and are the performance differences statistically significant?

How did the authors decide to choose the best layer to extract the features? What is the difference in performance observed by extracting features from different intermediate layers? Are these differences statistically significant? The authors need to explain the reason behind these choices if they feel something is better than the other.

The authors mentioned that they used the highest possible resolution for this study. However, they mention that they cropped the images to 1024 x 1024 dimensions. This contradicts their claim. What is the effect of increasing/decreasing input spatial resolution?

How did the authors make sure that the model arrives at the right decision for the right reason? It is important to perform saliency tests and compute class activation maps to localize the regions of interest learned by the pretrained models. The issues with reporting such high performances in detecting image-level COVID-19 labels with publicly available COVID-19 collections have been adequately discussed in the literature. Fundamentally, these CNN models are not learning the characteristics of the disease but that of the datasets, due to severe class imbalance issues and other constraints.

No comments here
Why not start the discussion?