key: cord-1039019-6cbeamjv authors: Ling, Yunchao; Cao, Ruifang; Qian, Jiaqiang; Li, Jiefu; Zhou, Haokui; Yuan, Liyun; Wang, Zhen; Zheng, Guangyong; Zhao, Guoping; Li, Yixue; Wang, Zefeng; Zhang, Guoqing title: An interactive viral genome evolution network analysis system enabling rapid large-scale molecular tracing of SARS-CoV-2 date: 2020-12-10 journal: bioRxiv DOI: 10.1101/2020.12.09.417121 sha: 8de3ec4f67226930f2db9aee05810769fc7d517f doc_id: 1039019 cord_uid: 6cbeamjv Comprehensive analyses of viral genomes can provide a global picture on SARS-CoV-2 transmission and help to predict the oncoming trends of pandemic. This molecular tracing is mainly conducted through extensive phylogenetic network analyses. However, the rapid accumulation of SARS-CoV-2 genomes presents an unprecedented data size and complexity that has exceeded the capacity of existing methods in constructing evolution network through virus genotyping. Here we report a Viral genome Evolution Network Analysis System (VENAS), which uses Hamming distances adjusted by the minor allele frequency to construct viral genome evolution network. The resulting network was topologically clustered and divided using community detection algorithm, and potential evolution paths were further inferred with a network disassortativity trimming algorithm. We also employed parallel computing technology to achieve rapid processing and interactive visualization of >10,000 viral genomes, enabling accurate detection and subtyping of the viral mutations through different stages of Covid-19 pandemic. In particular, several core viral mutations can be independently identified and linked to early transmission events in Covid-19 pandemic. As a general platform for comprehensive viral genome analysis, VENAS serves as a useful computational tool in the current and future pandemics. The World Health Organization (WHO) characterized Coronavirus disease 2019 (COVID- 19) as a global pandemic (pandemic) on March 11, 2020 . By the end of October, more than 45 million COVID-19 cases had been diagnosed with > 1 million deaths (https://covid19.who.int/). More than 190,000 SARS-CoV-2 genomes have been sequenced and published (https://www.epicov.org/epi3/) since the end of 2019. The comprehensive analyses of these genomes could provide a global picture of how the virus transmitted among different populations, which may help predict the oncoming trends of pandemic. The main approach for molecular tracing of viral transmission is to thoroughly compare the genomes of different viral strains, leading to a series of phylogenetic trees or evolution networks that can also help to interpret the genomic mutations along transmission [1] [2] [3] . Previously Lu and colleagues used linkage disequilibrium and haplotype map to analyze the genomes of 103 SARS-CoV-2 samples, and classified the viral genomes into type L and S based on mutations at positions 8782 and 28144 [4] . In another report, Corlett et al. used phylogenomic analyses to classify 93 SARS-CoV-2 genomes into 58 haplotypes, and further clustered these virus strains into five clades [5] . Similarly phylogenetic network was also constructed by Forster et al. using 160 viral genomes, which classified the viruses into three types (A/B/C) based on the nucleotide variants at loci 8782, 26144, 28144, and 29095 and the As mentioned earlier, evolution networks have an advantage over phylogenetic trees in terms of the spatial freedom of connectivity. Comparing with previous software (e.g., POPART and Pegas) that construct phylogenetic network using haplotypes, VENAS generates evolution network based on the MAF-adjusted Hamming distances between different genome types, allowing automatic identification of major transmission paths and core genome types without manual clustering. As a validation, we first examined if the VENAS can recapitulate the SARS-CoV-2 haplotype network identified by an early study using a small number of genomes before March 2020 [4] . Two clusters of SARS-CoV-2 bearing core variations in loci 8782 and 28144 were previously identified as the types L (8782C/28144T) and S (8782T/28144C), and the community detection method in VENAS can reliably separate the L/S type into two clades with the same core mutations (comparing left and right panels in Fig. 2a ). We further expanded the VENAS to analyze the viral genomes at the beginning of global transmission using a dataset containing all entries publicly released by March 25, 2020. The evolution network containing 1,050 viral genomes were constructed by VENAS in three seconds, producing a "backbone network" with 16 major viral clades in the topological space that reflect the possible transmission paths (Fig. 2b) . Each pair of viral clades was separated by a group of core variants that also were identified by comparing PIS sites in these clades (Table 2) . Some clades and associated variants were also reported by other researchers independently. For example, the core variants C8782T/T28144C recognized as the PIS between clades 1 and 2 (i.e., separating the clades 2, 3 and 4 from the other clades), which was also reported to distinguish the "type L/S" [4, 6] ; The variants G11083T between clades 1 and 5 were first identified from the patients in Diamond Princess Cruise and were also reported by Sekizuka et al [14] . Some of the clades also provided new information on virus transmission. For example, the clades 1 and 12 were connected by a set of tightly-linked variants (C241T/C3037T/A23403G), which can cause single amino acid substitution in S protein. This mutational event coincided with the early transmission to Europe as it was first identified in a patient traveled from Shanghai to Bavaria for a business meeting of Webasto [15] . Identification of the previously reported key mutations/variants suggests that VENAS can serve as a reliable analysis pipeline for the research community. In addition, VENAS also identify novel mutations that function as key evolution markers to separate the viral clades. To compare the mutational features of viral genome during different periods of COVID-19 pandemics, we applied VENAS to analyze the cases collected in different times, resulting viral evolution networks at different pandemic stages (Fig. 3) . We found that, during the early stage of global pandemic (before April, 2020), the viral genome types showed a limited diversity and can be clustered into 16 main clades (Fig. 3a) . These clades were centered around a clade containing the earliest strain detected mainly in Wuhan (clade 1). Importantly, the branching pattern of this evolution network reflected a clear path of transmission. For example, the "North American branch" (clade 2 -> 3 -> 4) containing variation C17747T/A17858G was directed branded out from clade 2, which was also named as type S previously [4, 17] . The viruses in the "Diamond Princess branch" (clade 5 to 9), carried by most patients in the community transmission event of the cruise ship, contain a key variation (G11083T) that caused an amino acid substitution in NSP6 protein. Finally, the "European branch" containing variation A23403G/C14408T was further divided into two branches characterized by variations G25563T and G28881A/G28882A/G28883C (clade 14), which were wide spread in Europe and North/South America. In the mid-stage of the pandemic (pre-June, 2020), the viral genome types expanded into additional branches with increasing diversity, and the cases in some of the original branches showed a slower growth compared to the new branches ( Fig. 3b and Supplementary table S3) . With the global transmission of the SARS-CoV-2, we found that the original "European branch" mutated into additional genome variants, resulting in many novel types that were clustered into new clades and sub-branches. As a comparison, the original "Wuhan clade" branched slowly with a much smaller genome type diversity, probably due to successful confinement of the To meet the challenges of the massive amount of sequencing data accumulated during COVID-19 pandemic, we developed a novel method for genomic tracing of virus transmission through an integrative analysis of viral genome and epidemic information. Several new algorithms were implemented in a high-performance parallel computing environment, resulting a rapid construction of high-resolution evolution networks of SARS-Cov-2 with interactive interface. Previously, the common method to study viral genome variations is through construction of phylogenetic trees, often using a neighbor joining algorithm [22] . We made It should also be noted that the VENAS network is a non-directional acyclic graph, and thus people should combine the epidemiological information (like time of sampling) when inferring transmission path from the network. Another caveat during network interpretation is that the "variant reversion" could be mistakenly called in some areas because the neighbor-joining method tend to force a fully connected network despite missing samples. In such case, a base at the certain locus may appear to be mutated back and forth on two adjacent or nearby edges. The insufficient sampling of some specific genome types can lead to a missing node that is required to construct a coherent path of nodes reflecting the real transmission events. In such situation, the algorithm will "enforce" the network construction through a neighbor node that is the closest to the real node. The enormous amount of mutational data accumulated during viral transmission was manifested in VENAS as large community-based clades through the modularity-based community detection algorithm such as Louvain. In each topological clade identified, the central node represents the predominant genome type in the given viral community, and the genome diversity of the viral community was reflected by the numbers of the inward and outward edges. We employed the network disassortativity trimming algorithm to merge the small clades at the edge of the network with the major clades, and the resulting backbone network of the major clades can effectively reflect the major evolutionary paths and associated core variations. In conclusion, the genomic tracking is critical to understand the global transmission of the SARS-CoV-2. We developed a software platform that can handle the massive amount of viral genomic data and rapidly generate the evolution network using a highly paralleled computation protocol. The topology-based community detection and the network disassortativity trimming algorithm of this method also enabled the identification of critical viral groups that formed a backbone network of virus transmission. Using the interactive user interface of VENAS, we were able to identify several known branches in the early stage of COVID-19 pandemic. We also detected additional new branches and sub-branches that help to interpret the transmission trends of later stages. As a general platform, we expect that VENAS can serve as a valuable tool for data analysis and visualization that support virus tracing and drug development for further pandemics. Full-length viral genome sequences were collected for genomic comparison and network construction. In this study, we selected four datasets from different stages of the COVID-19 pandemic. The first dataset cited 103 SARS-CoV-2 genome sequences used in the research The calculation of Hamming distance matrices was based on the number of differences in the effective Parsimony-Informative Sites (ePISs). First, all SARS-CoV-2 sequences should be multiple-aligned using MAFFT [23] to obtain a consensus alignment in ma format. Next, the number of unambiguous bases for each variation loci in all genomes was summed up, filtered and finally obtained the ePIS list. The filtering strategy was as following: 1) The site was a parsimony-informative which contained at least two types of nucleotides, and at least two of them occur with a minimum frequency of two; 2) the PIS was effective if the number of unambiguous bases ≥ 80% of the total genomes. MAF (Minor Allele Frequency, ν ) a quantitative indicator to evaluate the rareness of variations. We calculated each ePIS and used the maximum MAF of differential ePISs between two viral genomes to adjust the Hamming distance. Since ambiguous bases such as N, -, and degenerate bases increased the uncertainty in the later computation of Hamming distances, VENAS users could use MAF to filter viral genomes with ambiguous bases in high MAF ePISs. Firstly, we gave a set of ePISs to each viral genome and used this combination of ePIS (genome type) to represent it. Therefore, the number of genomes was reduced by eliminating redundancy genome types. The MAF(ν) of each ePIS was also calculated and indexed for later queries. Then we calculated the Hamming distance (d) matrix [12] , which defined the viral phylodynamic relationship between each two genome types [24, 25] , to evaluate the evolutionary distances. The Hamming distance matrix M(d ij , ν max ij ) was formed by combining the distance values with the maximum MAF of differential ePISs (Fig. S2a ). If the pair of genome types contained ambiguous bases, the distances of those ePISs were calculated as 0. ν max ij ) (Fig. S2b) . Then pairs of genome types were connected in sequence according to the list (Fig. S2c) . At each step of the connection, we abandon the link if the two genome types (P i and P j ) associated with the d ij were already indirectly connected by other genome types (Fig. S2d) . Finally, the fully-connected evolution network G(P,d) was constructed, with the nodes representing the genome types and the edges representing the differential ePISs (Fig. S2e) . Two third-party Python libraries, networkx (https://networkx.github.io/) and CDlib420(https://github.com/GiulioRossetti/CDlib), were used in the topological clustering and backbone network extracting process once the viral genome evolution network was constructed. Firstly, we used the Louvain modularity community detection algorithm to cluster the evolution network to a set of topological clades [26] . The Louvain algorithm [27] calculated the gain in modularity [28] as an optimization target and used a heuristic method to find the community clusters. Compared to other clustering algorithms, Louvain was faster to compute on large graph networks with guaranteed accuracy. Secondly, we applied a network disassortativity trimming algorithm to extract the core viral genome types of the evolution network. A highly diverse viral genome type tended to have a higher inward/outward degree as the edges represented the possible evolutionary paths of the virus. We analyzed the statistical degree distribution of all nodes from the network. The vast majority of the nodes had a degree of 1 (Fig. S3) . Therefore, we removed the nodes with degree 1 from the network and then filtered the remaining candidate nodes using the network disassortativity trimming algorithm [29] to obtain the core viral genome types. Finally, all core viral genome types were selected to complete the identification of the backbone network. We employed the Dijkstra algorithm to find the shortest path between each pair of genome types. All nodes on the shortest paths, no matter the core node or not, were added to the list of core viral genome. Thus the complete evolutionary paths between the core viral genome types and corresponding core variations were discovered. We integrate all the major evolutionary paths to complete the identification and construction of the backbone network. Not applicable. Not applicable. VENAS is cross-platform open-source software available at https://github.com/qianjiaqiang/VENAS. It can be easily installed from source code or using pre-build binary. The authors declare that they have no competing interests. Metagenomic sequencing at the epicenter of the Nigeria 2018 Lassa fever outbreak Consortium TCSME: Molecular Evolution of the SARS Coronavirus During the Course of the SARS Epidemic in China On the origin and continuing evolution of SARS-CoV-2 Decoding the evolution and transmissions of the novel pneumonia coronavirus (SARS-CoV-2 / HCoV-19) using whole genomic data Phylogenetic network analysis of SARS-CoV-2 genomes pegas: an R package for population genetics with an integrated-modular approach DnaSP 6: DNA Sequence Polymorphism Analysis of Large Data Sets The neighbor-joining method: a new method for reconstructing phylogenetic trees Prospects for inferring very large phylogenies by using the neighbor-joining method Error Detecting and Error Correcting Codes A note on two problems in connexion with graphs Haplotype networks of SARS-CoV-2 infections in the Diamond Princess cruise ship outbreak Investigation of a COVID-19 outbreak in Germany resulting from a single travel-associated primary case: a case series Genomic surveillance reveals multiple introductions of SARS-CoV-2 into Northern California Comprehensive evolution and molecular characteristics of a large number of SARS-CoV-2 genomes reveal its epidemic trends Comprehensive Analyses of SARS-CoV-2 Transmission in a Public Health Virology Laboratory Making Sense of Mutation: What D614G Means for the COVID-19 Pandemic Remains Unclear An updated analysis of variations in SARS-CoV-2 genome Mutations in SARS-CoV-2 viral RNA identified in Eastern India: Possible implications for the ongoing outbreak in India and impact on viral structure and host susceptibility Phylogenetic classification and the universal tree MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability Inferring HIV Transmission Dynamics from Phylogenetic Sequence Relationships Unifying the Epidemiological and Evolutionary Dynamics of Pathogens Community detection algorithms: a comparative analysis Fast unfolding of communities in large networks Analysis of weighted networks Mixing patterns in networks The author thanks the GISAID EpiFlu™ Database, the NCBI Genbank Database, and all contributors for sharing SARS-CoV-2 genome sequences and corresponding metadata.