key: cord-1026204-709ogssx
authors: Tseng, Elizabeth; Zeng, Qiandong; Iyer, Lax
title: VCFCons: a versatile VCF-based consensus sequence generator for small genomes
date: 2021-02-27
journal: bioRxiv
DOI: 10.1101/2021.02.26.433111
sha: c2e91be198a1e2da3d532e1e34444cd5453c1284
doc_id: 1026204
cord_uid: 709ogssx

We had developed VCFCons to address urgent need for a robust consensus sequence generator for SARS-CoV-2 viral surveillance, which presented several unique requirements, including: (a) low coverage areas should be noted with ‘N’s, (b) low frequency or suspicious variant calls need to be filtered. We have found that, while some existing tools such as bcftools can generate the desired consensus sequence, it required multiple filtering steps and additional scripting. VCFCons can generate consensus sequences based on variant calls in a VCF format with versatile filtering criteria based on coverage and estimated variant frequency. We applied VCFCons to the Labcorp SARS-CoV-2 sequencing data and showed that it generated correct consensus sequences that were successfully submitted to GISAID and NCBI. We hope the community will find value in this tool and aim to continue developing VCFCons to handle more complex viral data in the future.

Generating consensus sequences plays an important role in sequencing data analysis. Consensus sequences are generally created with reference-based approaches. The first approach uses a VCF file (Danecek, et al., 2011) as the input to build the consensus sequences by comparison with the reference, as is done by bcftools (Li, 2011) The second approach, used by CLCBio, iVar (Grubaugh, et al., 2019) and Racon (Vaser, et al., 2017) use read alignment (BAM) files or read pileup as the input and create the consensuses directly. A third approach, employed by the pbaa (PacBio amplicon analysis, https://github.com/PacificBiosciences/pbAA) tool, uses a reference-guided (but not reference-aligned) clustering approach to generate consensus sequence, which can then be converted into a VCF file to populate additional quality information for variant filtering.

In our efforts to generate consensus sequences for SARS-CoV-2 from sequencing data reflecting the variant calls, we found existing tools to be inadequate to address our needs. Though many of the tools produced satisfactory consensus sequences, we found the following issues with the consensus sequence generation process. Racon does not accept a VCF as an input or a threshold below which an 'N' base should be used to indicate inadequately covered bases. CLCBio consensus generated an incorrect number of 'N's, resulting in frameshift errors in the consensus sequence. iVar can only generate consensus sequences from an aligned BAM file, not a VCF. bcftools can accept a VCF file but ignores allele depth (AD) information and requires a separate mask file to produce 'N's in low-coverage regions.

As a result, we have developed VCFCons to generate consensus sequences from a VCF file with an accompanying BAM read depth file. VCFCons uses VCF variant calls as the input, but also takes into account the base-level reference coverage info, plus variant call quality and ALT allele frequency.

VCFCons generates a consensus sequence based on the following criteria: VCFCons has been tested with VCF files produced by the CLC variant caller, bcftools, DeepVariant (Poplin, et al., 2018) , and pbaa (PacBio amplicon analysis). As shown in Fig 1a, different variant callers produce variable VCF information.

VCFCons is available at https://github.com/Magdoll/CoSA/

We compare the results of VCFCons with bcftools and iVar. We did not compare against Racon because it cannot generate 'N's, nor can it accept VCF as input. Given the same VCF input file (which is the filtered VCF produced by VCFCons), bcftools (`bcftools consensus`) and VCFCons both generate the correct consensus sequence. However, bcftools requires an additional step of creating filtered VCFs since it does not read in AD/DP information and also needs masked regions information to produce 'N' bases. iVar (`ivar consensus`) can only accept a pileup file, not a VCF, and further produces an incorrect consensus sequence if there are large indel variants. For example, one of the Labcorp SARS-CoV-2 samples contained a 320bp deletion and the final consensus sequence should be 29583bp based on the Wuhan reference. Both bcftools and VCFCons generated 29583bp consensus sequence, while iVar generated 29903bp, since the pileup is unable to properly detect the large deletion. In summary, VCFCons requires fewer steps to generate the same consensus sequence as bcftools, though both generate accurate results, and can handle large indels and produce 'N' bases based on a variety of VCF formats, while other tools fail to produce correct consensus sequences.

We've developed a Python script that can generate consensus sequences for SARS-CoV-2 data based on variant calls in a VCF format with versatile filtering criteria based on coverage and estimated variant frequency. Currently, VCFCons assumes each VCF output contains one major species and will only output one consensus sequence. Future updates will take into account minor species in viral sequencing data as well as potential phased variant information. 

The variant call format and VCFtools

An amplicon-based sequencing framework for accurately measuring intrahost virus diversity using PrimalSeq and iVar

A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data

A universal SNP and small-indel variant caller using deep neural networks

Fast and accurate de novo genome assembly from long uncorrected reads

Brian Krueger for review and inputs to improve the manuscript and Labcorp and CDC COVID sequencing teams for providing the genomes which are also available in GISAID and NCBI