RR\ID Evidence Scale rating by reviewer:
Reliable. The main study claims are generally justified by its methods and data. The results and conclusions are likely to be similar to the hypothetical ideal study. There are some minor caveats or limitations, but they would/do not change the major claims of the study. The study provides sufficient strength of evidence on its own that its main claims should be considered actionable, with some room for future revision.
***************************************
Review: Overall, the paper seems statistically sound to me. The main message is that compositional data analysis can somewhat, but not drastically, improve upon naive forecasts of influenza subtype composition one year ahead. This is not a very strong claim (the authors speak of a "proof-of-concept") and despite some limitations I'd expect similar results in an "ideal study" of the presented methods. An ideal study might in addition identify stronger prediction approaches, possibly leveraging biological knowledge on virus competition and evolution.
In the following I provide specific points the authors may want to consider.
Major:
The evaluation period (2017 - 2019) is rather short. 81 countries are considered, increasing the number of forecasts to be evaluated. However, as there are 2 to 3 clusters of countries with synchronous patterns, the effective number of forecast/observation pairs is much lower (less independent information is available). As the authors rightly argue, it is hard to extend the evaluation period due to the influenza and COVID-19 pandemic periods. However, some discussion might be added on how this difficulty affects the robustness of the results.
In the descriptive part of the study I’d recommend presenting things on the natural percentage scale rather than the log-ratio transformation, which I find hard to interpret. For instance figure 1D is pretty, but I struggle to grasp the presented information. A simple series of stacked barplots showing the proportions would be easier to read. Fig 2 is even harder to make sense of. A “matrix” of small stacked bar plots (countries in rows, years in columns) for selected countries ordered by cluster may be an alternative.
The presented methods are purely statistical, while in the literature on vaccine composition people seem to think more in terms of evolution and biological dynamics. It would be great to read some discussion about what the described methods from compositional data analysis can add and what would be needed to move beyond the observed predictive performance.
I feel like there is a little more existing related work than discussed in the introduction. The authors mention the question of vaccine composition, but the link to this task and the corresponding literature could be developed more clearly. The following references may be relevant:
This series of peprints may be an interesting point of comparison. The focus seems to be a bit different, but maybe some relevant statements can be extracted.
Strain-specific within-season forecasts:
The authors split years in April, which means that the Southern Hemisphere influenza season and the following Northern Hemisphere season are lumped together. Would results look different if the split was made in, say, October, to lump seasons together the other way around? More generally, I wonder if a half-yearly perspective wouldn’t be more helpful.
The chosen transformation seems a bit arbitrary and the authors could strengthen their argument by providing some properties. Are the results invariant to the order in which subtypes are fed into the procedure? And what exactly is the “epidemiological interpretation”?
Related to the above, I am unsure about the mixing score. It is probably a useful proxy of what the authors want to describe, but I see no specific reason to define it this way and some questions remain, e.g.,
How does the score depend on the transformation used?
Does the score depend on the composition of the non-dominant subtypes? I think it should not, but it is unclear to me if it does.
The authors try to provide some intuition by stating that “>75%, corresponds to mixing score values between -0.9 and -0.8.” Wouldn’t it be much more convenient to simply use max(B, H1, H3) - 1/2, which is 1/2 for complete dominance and -1/6 for an equal split between subtypes?
The presented methods (especially visualizations, to a lesser degree models) address the special case of 3 sub-types. From a statistical perspective it would be nice to develop things in more generality.
Minor:
"Point predictions" seems more common terminology than "punctual predictions"
It seems uncommon to me to present error margins for measures of predictive quality and in some instances I was unsure how these were actually obtained.
I am unsure what is meant by "M5 HVAR model of lag 2 and 1" – is it order 2 and 1 or actually lag 2 in the sense of "only lag 2, but not lag 1"?
M3 average composition: after having read different parts of the paper, it did not become clear to me if averaging was done on the natural scale, the transformed scale or whether both were equivalent.