Skip to main content
SearchLoginLogin or Signup

Review 1: "SARS-CoV-2 RECoVERY: a multi-platform open-source bioinformatic pipeline for the automatic construction and analysis of SARS-CoV-2 genomes from NGS sequencing data"

Reviewer: Martin Höelzer (Robert Koch Institute) 📒📒📒 ◻️◻️

Published onApr 14, 2022
Review 1: "SARS-CoV-2 RECoVERY: a multi-platform open-source bioinformatic pipeline for the automatic construction and analysis of SARS-CoV-2 genomes from NGS sequencing data"
1 of 2
key-enterThis Pub is a Review of
SARS-CoV-2 RECoVERY: a multi-platform open-source bioinformatic pipeline for the automatic construction and analysis of SARS-CoV-2 genomes from NGS sequencing data
Description

AbstractBackgroundSince its first appearance in December 2019, the novel Severe Acute Respiratory Syndrome Coronavirus type 2 (SARS-CoV-2), spread worldwide causing an increasing number of cases and deaths (35,537,491 and 1,042,798, respectively at the time of writing, https://covid19.who.int). Similarly, the number of complete viral genome sequences produced by Next Generation Sequencing (NGS), increased exponentially. NGS enables a rapid accumulation of a large number of sequences. However, bioinformatics analyses are critical and require combined approaches for data analysis, which can be challenging for non-bioinformaticians.ResultsA user-friendly and sequencing platform-independent bioinformatics pipeline, named SARS-CoV-2 RECoVERY (REconstruction of CoronaVirus gEnomes & Rapid analYsis) has been developed to build SARS-CoV-2 complete genomes from raw sequencing reads and to investigate variants. The genomes built by SARS-CoV-2 RECoVERY were compared with those obtained using other software available and revealed comparable or better performances of SARS–CoV2 RECoVERY. Depending on the number of reads, the complete genome reconstruction and variants analysis can be achieved in less than one hour. The pipeline was implemented in the multi-usage open-source Galaxy platform allowing an easy access to the software and providing computational and storage resources to the community.ConclusionsSARS-CoV-2 RECoVERY is a piece of software destined to the scientific community working on SARS-CoV-2 phylogeny and molecular characterisation, providing a performant tool for the complete reconstruction and variants’ analysis of the viral genome. Additionally, the simple software interface and the ability to use it through a Galaxy instance without the need to implement computing and storage infrastructures, make SARS-CoV-2 RECoVERY a resource also for virologists with little or no bioinformatics skills.Availability and implementationThe pipeline SARS-CoV-2 RECoVERY (REconstruction of COronaVirus gEnomes & Rapid analYsis) is implemented in the Galaxy instance ARIES (https://aries.iss.it).

RR:C19 Evidence Scale rating by reviewer:

  • Potentially informative. The main claims made are not strongly justified by the methods and data, but may yield some insight. The results and conclusions of the study may resemble those from the hypothetical ideal study, but there is substantial room for doubt. Decision-makers should consider this evidence only with a thorough understanding of its weaknesses, alongside other evidence and theory. Decision-makers should not consider this actionable, unless the weaknesses are clearly understood and there is other theory and evidence to further support it.

***************************************

Review:


Here, the authors aim to provide an all-in-one pipeline for the reconstruction of SARS-CoV-2 genomes from different sequencing technologies. While such easy-to-use pipelines are needed by the worldwide community to rapidly reconstruct genomes for molecular surveillance and the detection of emerging variants, they also need to be accurate to support decision-making even based on single nucleotide changes. Unfortunately, I think that the pipeline in its current state does not produce high-quality genome sequences. While the used tools seem to some extent reasonable for short-read data, they will fail for the reconstruction of accurate genomes from Nanopore data.

Thus, I highly recommend either focusing the pipeline on short reads only or including proper analysis steps and tools that also support Nanopore data. In the current state, I would not recommend the pipeline for Nanopore data at all.

Major

[1] Read quality analysis and trimming

As described, the authors use Trimmomatic for a basic read qc. However, the removal of the remaining 5’ and/ or 3’ adapter sequences or primer sequences in particular for Illumina protocols is a crucial step that can also impact mapping/ variant calling if not properly done. I recommend adding additional functionalities for adapter trimming and primer clipping. For example, adapter trimming can be performed via fastp (that could be also a general replacement of Trimmomatic in terms of speed) while providing the adapter sequences in FASTA format. For primer clipping (e.g. derived from Illuminas CleanPlex protocol,) I can recommend bamclipper. This might complicate the workflow but is crucial for specific sequencing protocols like involving amplicons. Regarding Nanopore reads: do the authors also trim them with Trimmomatic? If so, there is no need and normally Nanopore data is only filtered by length. E.g. many labs use the well-established ARTIC amplicon protocol and, for example, select only reads between 400-700 nt (V3 protocol) for further processing.

[2] Subtraction of human sequences

I recommend not only mapping against the human reference genome but rather generating an index out of human+SARS-CoV-2. Otherwise, it could happen that (short) reads are sub-optimally mapping against the human genome that including a not inconsiderable amount of endogenous viral elements. Do the authors map Nanopore reads with Bowtie2 as well? Or Minimap2?

[3] Contig assembly

I am unsure if the authors are also using SPAdes for Nanopore data. I would recommend specialized long-read assembly tools such as flye. Besides, it is questionable if such a de novo step is needed at all if the authors construct the consensus reference-based.

[4] Genome reconstruction

The authors use mpileup and bcftools from SAMtools for the variant calling and consensus reconstruction. While these are basic tools for such tasks, there are also more sophisticated tools for variant calling already used by the SARS-CoV-2 community such as LoFreq, Freebayes, or GATK. Also, parameter settings such as allele frequency cutoffs, … are important to consider. For Nanopore data, the used procedure will result in many false variant calls (see below).

[5] Nanopore

The pipeline is lacking important steps needed for the proper analysis of Nanopore data. After mapping reads with e.g. minimap2 polishing steps (e.g. racon, Medaka) are needed to reduce errors in Nanopore data. Also, the variant calling should be not performed with default tools such as samtools but rather using the machine learning models e.g. implemented in Medaka for variant calling. If I understand Tab. 1 correctly, the results clearly show that the pipeline is not working for Nanopore data: 96 % of consensus sequences with different nucleotide calls. Sure, Genome Detective is not much better (90 %) but it might be that the tool is also not suitable for analyzing Nanopore data?

[6] Benchmark

First of all, it is unclear which genomes the authors used from their pipeline: the reference-based or de novo reconstructed ones? The authors report that the genomes produced are generally longer than the ones produced by CLC or Genome Detective. I wonder if these tools perform de novo assemblies of the reads or also use a reference-guided consensus strategy? Most pipelines currently available (such as https://github.com/connor-lab/ncov2019-artic-nf, https://github.com/replikation/poreCov, https://gitlab.com/RKIBioinformaticsPipelines/ncov_minipipe, …) perform reference-based reconstructions and thus rely on the length of the (Wuhan) reference genome. Thus, the length of the consensus genome is not necessarily a meaningful quality metric. I also wonder why the pipeline in the mean produced longer genomes than the GISAID references that might be also assembled reference-based. Are the reconstructions extended at the 5’ and 3’ end of the genome?

Besides, CLC seems to perform much better than the pipeline regarding a very important metric: % of consensus sequences with different nucleotide calls.

Comments
1
?
Luca De Sabato:

Dear Dr. Martin Höelzer,

We wish to thank you for your review of our pipeline SARS-CoV-2 RECoVERY. Based on your comments we think you screened the version 1 of the pipeline. The current release is the version 4, where we implemented different steps and changed some of the tools. Below, please find our answers point by point.

Major

[1] Read quality analysis and trimming

As described, the authors use Trimmomatic for a basic read qc. However, the removal of the remaining 5’ and/ or 3’ adapter sequences or primer sequences in particular for Illumina protocols is a crucial step that can also impact mapping/ variant calling if not properly done. I recommend adding additional functionalities for adapter trimming and primer clipping. For example, adapter trimming can be performed via fastp (that could be also a general replacement of Trimmomatic in terms of speed) while providing the adapter sequences in FASTA format. For primer clipping (e.g. derived from Illuminas CleanPlex protocol,) I can recommend bamclipper. This might complicate the workflow but is crucial for specific sequencing protocols like involving amplicons. 

Answer: We agree with the reviewer. Since version 2, we added ivar trim for primers removal, with the bed file containing information on the most used SARS-CoV-2 amplification kits.

Regarding Nanopore reads: do the authors also trim them with Trimmomatic? If so, there is no need and normally Nanopore data is only filtered by length. E.g. many labs use the well-established ARTIC amplicon protocol and, for example, select only reads between 400-700 nt (V3 protocol) for further processing. 

Answer: Since RECoVERY ver 2.5 we ditched the possibility to perform complete genome reconstruction from Nanopore data, as in the Italian surveillance of the SARS-CoV-2 variants, genomes are produced by Illumina and Ion Torrent platform, only.

[2] Subtraction of human sequences 

I recommend not only mapping against the human reference genome but rather generating an index out of human+SARS-CoV-2. Otherwise, it could happen that (short) reads are sub-optimally mapping against the human genome that including a not inconsiderable amount of endogenous viral elements. Do the authors map Nanopore reads with Bowtie2 as well? Or Minimap2? 

Answer: As the NGS data used in the Italian surveillance of the SARS-CoV-2 variants come from amplicon-based amplification methods, the percentage of host reads usually is lower than 10%. In addition, since the timely reconstruction of the complete viral genomes is crucial for surveillance of the variants we took the decision of applying this step only, which however, does not apply to Nanopore reads anymore (see previous point).

[3] Contig assembly

I am unsure if the authors are also using SPAdes for Nanopore data. I would recommend specialized long-read assembly tools such as flye. Besides, it is questionable if such a de novo step is needed at all if the authors construct the consensus reference-based. 

Answer: see the above

[4] Genome reconstruction

The authors use mpileup and bcftools from SAMtools for the variant calling and consensus reconstruction. While these are basic tools for such tasks, there are also more sophisticated tools for variant calling already used by the SARS-CoV-2 community such as LoFreq, Freebayes, or GATK. Also, parameter settings such as allele frequency cutoffs, … are important to consider. For Nanopore data, the used procedure will result in many false variant calls (see below). 

Answer:  SARS-CoV-2 RECoVERY ver. 1 used mpileup, bcftools and SAMtools, indeed producing many false variant calls and incomplete genome. RECoVERY Ver. 4, implements Ivar consensus and Ivar variants using the following option to reduce the false nucleotide calls.

Minimum quality score threshold to count base: 20

Minimum frequency threshold: 0.2

Minimum depth to call consensus: 30

[5] Nanopore

The pipeline is lacking important steps needed for the proper analysis of Nanopore data. After mapping reads with e.g. minimap2 polishing steps (e.g. racon, Medaka) are needed to reduce errors in Nanopore data. Also, the variant calling should be not performed with default tools such as samtools but rather using the machine learning models e.g. implemented in Medaka for variant calling. If I understand Tab. 1 correctly, the results clearly show that the pipeline is not working for Nanopore data: 96 % of consensus sequences with different nucleotide calls. Sure, Genome Detective is not much better (90 %) but it might be that the tool is also not suitable for analyzing Nanopore data?

Answer: as reported in the answers to the points above, we ditched the possibility to analyze Nanopore data. This is a dynamic work, though. Should we be in the need to re-implement Nanopore data analysis in the future we will surely take into consideration your suggestions and advice (and maybe ask for more advice given your experience with this long reads strategy). Regarding the online platform Genome Detective, we didn’t use it in a while and we don’t know if the authors improved the analysis of Nanopore data.

[6] Benchmark

First of all, it is unclear which genomes the authors used from their pipeline: the reference-based or de novo reconstructed ones? 

Answer: The NGS data used for SARS-CoV-2 genome reconstruction were all based on amplicon-based amplification method. Most of the submitters don’t report the software used and the options used for nucleotide calling or the algorithm, making it difficult to assess the outcome of the comparison with other software.

The authors report that the genomes produced are generally longer than the ones produced by CLC or Genome Detective. I wonder if these tools perform de novo assemblies of the reads or also use a reference-guided consensus strategy? Most pipelines currently available (such as https://github.com/connor-lab/ncov2019-artic-nf, https://github.com/replikation/poreCov, https://gitlab.com/RKIBioinformaticsPipelines/ncov_minipipe, …) perform reference-based reconstructions and thus rely on the length of the (Wuhan) reference genome. Thus, the length of the consensus genome is not necessarily a meaningful quality metric. I also wonder why the pipeline in the mean produced longer genomes than the GISAID references that might be also assembled reference-based. Are the reconstructions extended at the 5’ and 3’ end of the genome? 

Besides, CLC 

Answer: We compared only the results of the software reported in the study due to the lack of information on tools and pipelines used for the genome reconstruction of the genome downloaded from GISAID.

CLC and Genome detective used different strategies. With CLC we used a reference-based method while Genome detective is based on a de novo strategy.

RECoVERY doesn’t extend the 5’ and 3’ end of the genome. The nucleotide calls depend on the options reported above. The difference in length between the genomes produced with RECoVERY and those produced with Genome detective may depend on the different strategies used by software. Similarly, the differences with CLC may depend on the algorithm of the software but, since these software are black-box it is not possible to change the algorithm and we tried only to set the same options used in ivar consensus.