key: cord-0908103-whmawr4q
authors: Song, Miranda; Moshiri, Niema
title: An Analysis of SARS-CoV-2 Using ViReport
date: 2020-06-21
journal: bioRxiv
DOI: 10.1101/2020.06.20.163162
sha: 6b10d7298368950df56ceaf52bd4cf7dd5dc0e94
doc_id: 908103
cord_uid: whmawr4q

The ongoing outbreak of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has resulted in millions of cases and hundreds of thousands of deaths. Given the current lack of treatments or vaccines available, it may be useful to trace the evolu-tion and spread of the virus to better develop methods of preventative intervention. In this study, we analyzed over 4,000 full genome sequences of human SARS-CoV-2 using novel tool ViReport [13], an automated workflow for performing phylogenetic analyses on viral sequences and generating comprehensive molecular epidemiologi-cal reports. The complete ViReport output can be found at https://github.com/mirandajsong/ViReport-SARS-CoV-2.

Coronaviruses are a group of RNA viruses that cause various diseases in humans, other mammals, and birds. In the last two decades, there have been three documented strains of coronavirus that have caused major epidemics in humans. These include severe acute respiratory syndrome coronavirus (SARS-CoV) in 2002, Middle East respiratory syndrome coronavirus (MERS-CoV) in 2012, and most recently, a novel coronavirus referred to as COVID-19 or SARS-CoV-2 that emerged in Wuhan, China in December 2019. As of June 2020, there have been over 8 million confirmed cases of SARS-CoV-2 globally [3] .

The rapid spread of SARS-CoV-2 has demonstrated the need to understand how the virus spreads and for measures of epidemic intervention such as stay-at-home orders and social distancing. To gain insight into disease transmission and origins, phylogenetic analyses are often performed using bioinformatics methods to quantify and visualize the relatedness of viral sequences. A typical workflow for phylogenetic analysis and reconstruction involves four main steps: (1) Data Acquisition, (2) Sequence Alignment, (3) Phylogenetic Inference, and (4) Tree Visualization [6] .

Acquiring data involves identifying homologous nucleotide or protein sequences that are representative of the point of interest, whether that be a specific gene or species or something else. This step may involve additional preprocessing (e.g. filtering by length or sample type) to ensure that the sequences are valid and that the inferred tree will provide meaningful results. There are many public databases that provide viral (and other) sequences and accompanying metadata, including NCBI (ncbi.nlm.nih.gov), GISAID (gisaid.org), and ViPR (viprbrc.org).

Multiple sequence alignment (MSA) refers to the alignment of related biological sequences of similar length, the output of which may be used to infer more specific evolutionary relationships. Given the classification of MSA computation as an NP-complete problem, there are many existing algorithms that solve this problem [1] . Examples include MAFFT [7] , MUSCLE [5] , and ViralMSA [15] , all of which are implemented in ViReport, along with others.

Phylogenetic inference is the reconstruction of evolutionary trees by grouping individual elements based on shared ancestry. Estimation of phylogenetic trees can be performed through various methods such as Bayesian Inference and Maximum Likelihood [6] , which have spawned tools such as BEAST [4] and RAxML [17] , respectively. The resulting phylogenies can be input into online tree visualization tools for further examination.

While there exist various tools that seek to automate this process such as MEGA5 [18] and Phylogeny.fr [2] , ViReport is robust in its integration of modules that allows for the user to select their desired tool for each stage in the workflow. Furthermore, ViReport includes options for rooting and dating the inferred phylogeny, calculating pairwise distances, ancestral sequence reconstruction, and transmission clustering, all of which provide additional information beyond the tree itself. 

ViReport was used to preprocess the data, perform the MSA and phylogenetic inference steps, and produce the input tree files for tree visualization.

For this study, sequences were retrieved from the NCBI Virus database, which hosts data from GenBank. Nucleotide sequences were selected for using the SARS-CoV-2 Virus specification and by filtering out sequences under 26,000 bases as to only include full genome sequences [22] . Sequence metadata was also compiled, specifically that regarding the coun-try of origin and the collection date.

Multiple Sequence Alignment was performed using ViralMSA [15] utilizing Minimap2 [10] and the reference sequence with accession ID NC 045512, defined as "Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1" and dated December 2019. Pairwise sequence distances were calculated using the tn93 tool from HIV-TRACE [8].

FastTree 2 [16] was used to infer a maximum-likelihood phylogeny under the General Time-Reversible model [19] using a Gamma20-based likelihood. The phylogeny was MinVarrooted using FastRoot [11] and dated using LSD2 [20] . Pairwise phylogenetic distances were calculated using TreeSwift [14] .

The phylogenies were visualized using iTOL [9] and IcyTree [21] . Comparison of the visualization of the ViReport output and the figure from this study [23] shows that the results are very similar, despite the two analyses having used different tools (ViralMSA and IQ-TREE for ViReport, MUSCLE and RAxML for the article [23] ), as the overall clades are nearly identical. Both analyses indicate that, while SARS-CoV, MERS, and SARS-CoV-2 fall into distinct clades, SARS-CoV-2 is more closely related to SARS-CoV, as both fall under the subgenus Sarbecovirus, whereas MERS falls under a different subgenus Merbecovirus. Additionally, SARS-CoV-2 is most closely related to a SARS-like coronavirus found in bats. We can see from the histogram above that the peak of the sample distribution falls around late March 2020. Sequences were also grouped by geographical location, as shown below. We can see from the distribution that nearly 40 countries were represented in the dataset, with the five highest sequence counts from the U.S., Australia, Hong Kong, Germany, and China. Incidentally, as of June 2020, the top five countries in terms of total number of confirmed cases are the U.S., Brazil, Russia, India, and the United Kingdom [3] . While there does not seem to be significant correlation between sequence counts and confirmed cases, which could be the result of sampling bias, it is notable that the U.S. has the highest counts for both metrics. The following maximum-likelihood phylogeny was inferred, with an average pairwise phylogenetic distance of 0.000381 and a standard deviation of 0.000208. The maximum pairwise phylogenetic distance (i.e. tree diameter) was 0.00505, which describes the distance between the two least closely related sequences. Given the large number of sequences included, it is evidently difficult to visualize the specific relationships between sequences in the tree. These individual relationships can be visualized by inputting the rooted tree file to iTOL, but the tree is too large to include in this report.

The pairwise phylogenetic distances were also used for transmission clustering using TreeN93 [12] , which resulted in 736 clusters and 2535 singletons. The average cluster size was 2.853 with a standard deviation of 2.0284, and cluster sizes ranged from 2-30. 

Some considerations of the limitations of this analysis include the dataset used; given that there are well over 8 million confirmed cases of SARS-CoV-2 in the world (as of June 2020) but the sample only included around 4500 sequences, the sample is evidently a very small subset that may not be representative of current events both in geographical distribution and time. For example, the UK has the fifth most cases in the world [3] but is not listed as one of the categories represented by the sample. As the viral sequence database updates with new sequences or sequences from other sources are pooled together, the analysis may become more representative.

Future directions for this work include additional analysis of the generated statistics; a topic of interest is how the sequences are distributed by date for each country, as this may be indicative of how successful policies in place are in terms of managing the ongoing epidemic. Another topic is exploring the distribution of pairwise distances over time, which may lead to conclusions regarding the mutation rate of the SARS-CoV-2 virus. With regard to ViReport, additional work is being done to analyze potential differences in results based on which tools are selected at each step.

All genome sequences are available at NCBI Virus. The complete ViReport output can be found at https://github.com/mirandajsong/ViReport-SARS-CoV-2.

Multiple sequence alignment modeling: methods and applications

Phylogeny.fr: robust phylogenetic analysis for the non-specialist

An interactive web-based dashboard to track COVID-19 in real time

BEAST: Bayesian evolutionary analysis by sampling trees

MUSCLE: multiple sequence alignment with high accuracy and high throughput

Building Phylogenetic Trees from Molecular Data with MEGA. Molecular Biology and Evolution

MAFFT multiple sequence alignment software version 7: improvements in performance and usability

HIV-TRACE (TRAnsmission Cluster Engine): a Tool for Large Scale Molecular Epidemiology of HIV-1 and Other Rapidly Evolving Pathogens

Interactive Tree Of Life (iTOL) v4: recent updates and new developments

Minimap2: pairwise alignment for nucleotide sequences

Minimum variance rooting of phylogenetic trees and implications for species tree reconstruction

TreeN93: a non-parametric distance-based method for inferring viral transmission clusters. bioRxiv, page 383190

TreeSwift: A massively scalable Python tree package. Soft-wareX

ViralMSA: Massively scalable reference-guided multiple sequence alignment of viral genomes. bioRxiv

FastTree 2 -Approximately Maximum-Likelihood Trees for Large Alignments

RAxML version 8: a tool for phylogenetic analysis and postanalysis of large phylogenies

MEGA5: Molecular Evolutionary Genetics Analysis Using Maximum Likelihood, Evolutionary Distance, and Maximum Parsimony Methods

Some Probabilistic and Statistical Problems in the Analysis of DNA Sequences

Fast Dating Using Least-Squares Criteria and Algorithms

IcyTree: rapid browser-based visualization for phylogenetic trees and networks

Coronavirus genomics and bioinformatics analysis. Viruses

A Novel Coronavirus from Patients with Pneumonia in China