key: cord-0714885-ni57qv8m
authors: Bello, Xabier; Pardo-Seco, Jacobo; Gómez-Carballa, Alberto; Weissensteiner, Hansi; Martinón-Torres, Federico; Salas, Antonio
title: CovidPhy: A tool for phylogeographic analysis of SARS-CoV-2 variation
date: 2021-08-20
journal: Environ Res
DOI: 10.1016/j.envres.2021.111909
sha: 8a68c0f03bfbf17c4aa85fd3268cf8aee3bb20ba
doc_id: 714885
cord_uid: ni57qv8m

The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is the pathogen responsible for the coronavirus disease 2019 (COVID-19) pandemic. SARS-CoV-2 genomes have been sequenced massively and worldwide and are now available in different public genome repositories. There is much interest in generating bioinformatic tools capable to analyze and interpret SARS-CoV-2 variation. We have designed CovidPhy (http://covidphy.eu), a web interface that can process SARS-CoV-2 genome sequences in plain fasta text format or provided through identity codes from the Global Initiative on Sharing Avian Influenza Data (GISAID) or GenBank. CovidPhy aggregates information available on the large GISAID database (>1.49M genomes). Sequences are first aligned against the reference sequence and the interface provides different sources of information, including automatic classification of genomes into a pre-computed phylogeny and phylogeographic information, haplogroup/lineage frequencies, and sequencing variation, indicating also if the genome contains known variants of concern (VOC). Additionally, CovidPhy allows searching for variants and haplotypes introduced by the user and includes a list of genomes that are good candidates for being responsible for large outbreaks worldwide, most likely mediated by important superspreading events, indicating their possible geographic epicenters and their relative impact as recorded in the GISAID database.

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a singlestranded RNA virus responsible for the coronavirus disease 2019 pandemic. There has been a massive interest in sequencing genomes from coronavirus circulating in COVID-19 patients worldwide since its first early sequencing in December 2019 . Genomes are stored in public repositories such as GenBank (https://www.ncbi.nlm.nih.gov/sars-cov-2/) or, more specifically, in The Global Initiative on Sharing Avian Influenza Data (GISAID; https://www.gisaid.org) (Shu and McCauley 2017) During the last few months, several software applications and web tools have been developed that aim at understanding SARS-CoV-2 variation as well as dissemination in a worldwide scale. One of the most popular tool is Nextstrain (https://nextstrain.org; (Hadfield et al. 2018) ), which provides a maximum likelihood phylogeny built on a massive amount of SARS-CoV-2 genomes, and which allows to investigate the phylodynamics of the virus since the beginning of the pandemic. However, nomenclature of the Nextstrain phylogeny is limited to a few nodes among hundreds, it does not follow systematic criteria for naming clades, and it is stable since it has undergone several changes over the last few months. Moreover, mutational pathways along branches from the root to a given node can only be reconstructed partially in most of the tree; therefore, although very informative from the phylodynamics point of view, the phylogeny does not allow to classify a genome into a phylogenetic node with the exception of a few We have developed CovidPhy (www.covidphy.eu), a web tool that allows to process and analyze complete SARS-CoV-2 genomes. CovidPhy implements a pipeline that accepts newly generated sequencing fasta files, but also the identification codes of genomes stored in GISAID or GenBank repositories. It classifies genomes into main phylogenetic nodes and offers information on viral variants and clade frequencies worldwide.

By inspecting the large GISAID database, it makes it possible to identify specific SARS-CoV-2 sequences as strong candidates for being responsible for notable COVID-19 outbreaks. The first attempt was carried out by Gómez-Carballa et al. (2020a) ; in this early article, we explored the database available at that time (containing >4.7K SARS-CoV-2 genomes) for identical genomes that 

CovidPhy uses data from GISAID to compute variant and clade (haplogroup) frequencies and infer SARS-CoV-2 candidates responsible for outbreaks.

Genomes are aligned against the reference genome with GenBank accession 

The whole stack is written in Nim programming language. Nim is one of the best performing languages (https://github.com/kostya/benchmarks; https://github.com/def-/nim-benchmarksgame), usually on par with C, without sacrificing readability and expressiveness. The graphics for the webpage are created using Plotly (https://plotly.com/javascript/), interfaced using Nim both in the frontend and in the backend. The web stack is a single binary, but the core (the aligner and the classifier) is decoupled, and therefore it could be reused to build also a graphical user interface (GUI) and command line interface (CLI) (Figure 1) . We provide three main programs in the repository:

 covidphy, the web server that can be reached at www.covidphy.eu;

 covidphy_cli, a command line interface that is used to classify sequences in our internal pipelines;

 covidphy_gui, a simple graphic interface for those unfamiliar with CLI, who may prefer to select the input with buttons

Outbreak candidates were investigated in GISAID by searching identical haplotypes detected at least 20 times in a period of five consecutive days in a specific country or state. Once an outbreak is detected, we determine its length by adding consecutive days after the identified event, so long as the displaced 5day window meets the previous condition of at least 20 identical sequences; alternatively, we reduce the interval to the extreme if the number of counts equals to 0 while still satisfying the minimum criterion of at least 30 identical haplotypes in the shortened period. These analyses were carried out using R software (R core Team 2019). It takes about 0.1 seconds to align a single genome and carry out the clade classification; the rest of the analysis is even faster because the results displayed have been precomputed previously.

While sequences codes from NCBI can be retrieved automatically from the website, GISAID does not allow sharing genomes stored in their database; therefore, to investigate these sequences, the user must first register in the GISAID platform and download them directly. When a GISAID code is entered,

CovidPhy only provides information on its haplogroup assignation and frequency, but not on the variants involved. Lineage assignation of the candidate genome is provided, and the continental frequency of this haplogroup can also be graphically displayed on a map.

The COVID-19 pandemic has impacted every region of the world. There is much interest in investigating and tracking evolutionary characteristics of SARS-CoV-2 variants and lineages. Apart from the interest of CovidPhy for research, there is also demand from e.g. microbiological units in hospitals lacking the bioinformatic tools for the treatment of the genome sequences that they generate on a daily basis. There are similar tools available that can process SARS-CoV-2 genome sequences and carry out different kinds of analyses. Compared to previous developments, CovidPhy offers additional features. For instance, it automatically J o u r n a l P r e -p r o o f classifies a given genome into a clade by providing a SARS-CoV-2 phylogeny, and to provide phylogeographic information for this genome. Additionally, it allows variant(s) searches in the large GISAID database providing with information on frequencies for haplotypes containing the queried variation. It also provides information on lineages that have played a critical role in the dispersal of the SARS-CoV-2 pathogen, by initiating rapid and sudden outbreaks across the world.

CovidPhy has been specifically designed for treating information on the SARS-CoV-2 sequences, but it can be easily scaled to other microorganisms of interest for which large datasets are available. (e.g. for influenza in GISAID).

We gratefully acknowledge GISAID and contributing laboratories for giving us access to the SAR-CoV-2 genome database. This study received support from projects: GePEM (Instituto de Salud Carlos III(ISCIII)/PI16/01478/Cofinanciado 

Clustering and superspreading potential of SARS-CoV-2 infections in Hong Kong

SARS-CoV-2: Opportunities for interventions and control

A bloody mess': Confusion reigns over naming of new COVID variants

Estimated transmissibility and impact of SARS-CoV-2 lineage B.1.1.7 in England

Mapping genome variation of SARS-CoV-2 worldwide highlights the impact of COVID-19 super-spreaders

Phylogeography of SARS-CoV-2 pandemic in Spain: a story of multiple introductions, micro-geographic stratification, founder effects, and super-spreaders

An online coronavirus analysis platform from the National Genomics Data Center

Nextstrain: real-time tracking of pathogen evolution

MAFFT multiple sequence alignment software version 7: improvements in performance and usability

Phylogenetic analysis of SARS-CoV-2 in Boston highlights the impact of superspreading events

CoV-Seq, a New Tool for SARS-CoV-2 Genome Analysis and Visualization: Development and Usability Study

Secondary attack rate and superspreading events for SARS-CoV-2

Pitfalls of barcodes in the study of worldwide SARS-CoV-2 variation and phylodynamics

R: A Language and Enviroment for Statistical Computing. R Foundation for Statistical Computing

A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology

Addendum: A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology

Superspreading: The engine of the SARS-CoV-2 pandemic

VAPiD: a lightweight cross-platform viral annotation pipeline and identification tool to facilitate virus genome submissions to NCBI GenBank

GISAID: Global initiative on sharing all influenza data -from vision to reality

Genetic structure of SARS-CoV-2 reflects clonal superspreading and multiple independent introduction events

VIGOR, an annotation program for small viral genomes

HaploGrep 2: mitochondrial haplogroup classification in the era of high-throughput sequencing

A new coronavirus associated with human respiratory disease in China

☒ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.☐The authors declare the following financial interests/personal relationships which may be considered as potential competing interests:J o u r n a l P r e -p r o o f