RR:C19 Evidence Scale rating by reviewer:
Potentially informative. The main claims made are not strongly justified by the methods and data, but may yield some insight. The results and conclusions of the study may resemble those from the hypothetical ideal study, but there is substantial room for doubt. Decision-makers should consider this evidence only with a thorough understanding of its weaknesses, alongside other evidence and theory. Decision-makers should not consider this actionable, unless the weaknesses are clearly understood and there is other theory and evidence to further support it.
The authors propose a machine learning-based method for combining quantitative readings from pairs of COVID-19 lateral flow assays and show that, using their method, it is possible to achieve higher performance than with a single assay alone. From a machine learning perspective, this resembles an ensemble, albeit a biochemical one. It is illuminating to see that combining diverse biochemical assays can improve performance just as combining diverse weak learners can produce a strong learner in machine learning ensembles.
For assays, as in machine learning, the innovation comes in the method of combination. The authors observe that when single IgG and IgM tests are combined with an OR operation, the sensitivity improves relative to IgG or IgM alone, but the specificity suffers. The authors use a machine learning classifier to combine IgG and IgM semi-quantitative readings such that it enhances both sensitivity and specificity. This is a reasonable proposition: given two quantitative features one may train a classifier to predict the target outcome based on known ground truth. With just two features as input though, the feature space is visualizable and it would have been informative to see scatter plots of IgG vs. IgM features marked with the positive and negative labels. These scatter plots would permit the selection of classifier architecture most suited to the application. If a linear decision boundary sufficed, then one could use, for example, logistic regression; or if a non-linear boundary was needed, an RBF-SVM, for example, would be indicated. The authors decided to use a specific implementation of gradient-boosting machines, XGBoost. This is a good general-purpose classifier, but more suited to problems in high-dimensional feature spaces. A radial-basis-function SVM only has two hyperparameters to optimize vs. the roughly two dozen for XGBoost, and with the limited amount of data available, fewer is better. The SVM classifier is also eminently interpretable by observing the sample points that are chosen as support vectors during optimization.
The authors next consider all possible LFA pairs, applying an AND operation between the single LFA outcomes, whose IgG and IgM lines are combined using the OR operation as described above. While this AND-OR-based LFA pairing method recovers the specificity, it comes at the cost of lost sensitivity. As a remedy, they again resort to machine learning to realize a better combination of IgG and IgM readings in LFA pairs. It is assumed (though not explicitly stated) that when combining LFA pairs, four features form the input (an IgG and an IgM line from each of the constituent LFAs). A four-dimensional feature space is also readily visualizable through the use of a 4 x 4 lower-triangular matrix whose off-diagonal elements are pairwise feature scatter plots and whose diagonal elements are feature histograms. These plots would go a long way towards understanding the feature space and the LFA pairing problem in a fundamental way. Straightaway using XGBoost, the Swiss Army Knife of classifiers, does nothing to reveal the structure of the problem.
Another issue is the choice of XGBoost hyperparameters that were optimized. One of main levers of bias-variance tradeoff is the number of boosting iterations, num_round; this hyperparameter was omitted from consideration. Also, the dataset had a fairly high level of imbalance between positive and negative samples: 79 vs. 139, and yet the hyperparameter used to compensate for imbalance, scale_pos_weight, was unused. Finally, the hyperparameter colsample_bytree controls the fraction of input features used in each decision tree. With just 2 or 4 input features, this hyperparameter can have a draconian impact on what features are used and without further information on the optimized value or a report on "feature importances", it is impossible to know how IgG and IgM features are used to reach a combined decision.
Notwithstanding these detailed questions about the use of machine learning, the authors do establish some important results. With the non-machine learning combination methods, single LFA and paired LFA sensitivities cannot be directly compared because their specificities are different and fixed. Since machine learning methods generally output a (continuously valued) positivity score, it becomes possible to threshold that score to achieve any desired false positive rate (FPR), thereby providing a direct way to compare single LFA to paired LFA results at equal FPR.
The more significant result is that combining certain biochemical assays using machine learning techniques enables a more powerful test. The authors show an interesting and sometimes counter-intuitive pattern of how pairing different LFAs sometimes achieves significant sensitivity improvements and at other times not. For example, individually, LFAs from vendor 6, 9, and 10, achieve sensitivities of 77%, 80%, and 88%, respectively. Pairing LFAs from vendors 9 and 10 results in a sensitivity of 89%, whereas pairing LFAs from vendors 6 and 10 gives 91% sensitivity. The methodology proposed by the authors can be used to make informed combinations of LFAs to achieve optimal performance for a given test. It would be interesting to explore if further gains are possible from combining more than 2 LFAs, say 3, 4, or even all 9 of the vendor LFAs that were tested. This is a regime where the XGBoost classifier would come into its own in being able to select discriminating features from among a fairly large set of input features, up to 20. There is likely a tradeoff between the cost of adding more LFAs to an "ensemble" test and the resulting performance gains of the test. The authors have established that the optimal number is 2 or greater. The authors have reported initial results with antibody tests. It would be very useful to see this work expanded to other tests types, and see if this principle still generally holds.
While the authors have focused on combining existing commercially available tests, this may not be the best or most practical implementation of this approach for an actual diagnostic product. Lot-to-lot variability or assay changes that are likely outside the control of the machine learning algorithm developer may render the training results invalid or skewed heavily. Therefore, to implement a product exactly as described here seems a risky approach for addressing limitations of commercially available assays. However, more broadly, an approach that takes multiple similar assays, perhaps two strips with different characteristics that are controlled by the same manufacturer, and validated together with the app, may be a promising route for low cost tests with greater accuracy.