RR:C19 Evidence Scale rating by reviewer:
Reliable. The main study claims are generally justified by its methods and data. The results and conclusions are likely to be similar to the hypothetical ideal study. There are some minor caveats or limitations, but they would/do not change the major claims of the study. The study provides sufficient strength of evidence on its own that its main claims should be considered actionable, with some room for future revision.
***************************************
Review Summary:
The study employs five deep learning techniques to classify CXR images with COVID-19 or non-COVID-19 labels with and without a late fusion technique. The images from the tested dataset were extracted from the same X-ray machine and the results are reliable, but could be more deeply investigated and discussed.
Review:
The manuscript by Goldstein et al. presents an experimental study regarding the use of different deep learning classification techniques (ResNet34, ResNet50, ResNet15, Chexpert and VGG16) to identify COVID-19 in X-ray images. They have also tested an ensemble (also known as late fusion) of the deep learning techniques using a majority voting combination schema. Before the training/testing steps, they propose a preprocessing phase that is comprised by Data Augmentation and Lung Segmentation techniques. Furthermore, their approach also provides a nearest neighbor mechanism in the end of the classification process in order to provide four similar images to the user, which can give physicians references to previous patients that had lung findings similar to the analyzed ones.
The proposed approach is valid and the results can be considered reliable. However, the paper does not present a strong novelty, since this kind of study was already published in the literature (Apostolopoulos and Mpesiana, 2020), (Altan and Karasu, 2020), (Brunese et al., 2020), (Civit- Masot et al., 2020), (Makris et al., 2020), and so on. Moreover, the authors did not do a deep literature review and do not provide sufficient related work.
The dataset used in the experiments is the strong point of the paper. The 2427 CXR images were collected from 1384 patients of four hospitals in Israel, which were taken from the same portable X-ray machines. Since the CXR images were extracted from similar machines, the dataset bias in the classification results (which is a main drawback in this type of study, as shown in Maguolo and Nanni (2020)) is minimized. Nevertheless, the dataset used in the experiments could be made freely available for download by the authors, since it would be very interesting and helpful to other machine learning researchers.
The best results were achieved with the ensemble schema, which is acceptable, since the combination of the predictions may be complementary to each other. However, the authors could make more experimental studies concerning the possible combinations of the deep learning techniques into the ensemble. As they have only tested the combination of all the classifiers in once, it is not clear that combining only a few of them do not improve the results (maybe even more then using all of them).
There is a confusion in the results table, since the results in bold are not the best results, as they claim. The authors have named the evaluation metrics section as “statistical analysis,” which is conceptually misleading.
The Data Augmentation and the Lung Segmentation techniques could be more deeply investigated. For instance, the authors can show the classification results for all the deep learning models without using these preprocessing techniques. Thus, they could show how these techniques can impact the model’s learning process (the authors provide only the results for ResNet50 without any preprocessing). The use of Explainable AI (XAI) techniques in the lung segmentation process, such as in Teixeira et al. (2020), can be useful to confirm that the segmentation technique is in fact contributing to the identification of pneumonia spots in the lungs.
In general, the authors did a good job doing a qualitative analysis of the model. In this analysis, they have shown a confidence of the model, since they have computed a classification score histogram of the probabilities and a t-distributed Stochastic Neighbor Embedding (t-SNE). This analysis makes their experimental results more reliable.
Even though the discussion is fair and the insights are valid, the paper is missing a deeper discussion, with non-parametric statistical analysis over the results. In the current version of the manuscript, the discussion section is acting as a conclusion section, since there is no conclusion section in the paper.
Recommendation:
Minor Revise
References:
1. Apostolopoulos, I.D. and Mpesiana, T.A., 2020. Covid-19: automatic detection from x-ray images utilizing transfer learning with convolutional neural networks. Physical and Engineering Sciences in Medicine, p.1.
2. Altan, A. and Karasu, S., 2020. Recognition of COVID-19 disease from X-ray images by hybrid model consisting of 2D curvelet transform, chaotic salp swarm algorithm and deep learning technique. Chaos, Solitons & Fractals, 140, p.110071.
3. Brunese, L., Mercaldo, F., Reginelli, A. and Santone, A., 2020. Explainable deep learning for pulmonary disease and coronavirus COVID-19 detection from X-rays. Computer Methods and Programs in Biomedicine, 196, p.105608.
4. Civit-Masot, J., Luna-Perejón, F., Domínguez Morales, M. and Civit, A., 2020. Deep Learning system for COVID-19 diagnosis aid using X-ray pulmonary images. Applied Sciences, 10(13), p.4640.