key: cord-0706298-nc1sl2mv
authors: Xi, Binbin; Jiang, Dawei; Li, Shuhua; Lon, Jerome R; Bai, Yunmeng; Lin, Shudai; Hu, Meiling; Meng, Yuhuan; Qu, Yimo; Huang, Yuting; Liu, Wei; Huang, Lizhen; Du, Hongli
title: AutoVEM: an automated tool to real-time monitor epidemic trends and key mutations in SARS-CoV-2 evolution
date: 2021-04-05
journal: Comput Struct Biotechnol J
DOI: 10.1016/j.csbj.2021.04.002
sha: 2d1f20d2937be6ec02b6129ff4e5da3064934c8c
doc_id: 706298
cord_uid: nc1sl2mv

With the global epidemic of SARS-CoV-2, it is important to effectively monitor the variation, haplotype subgroup epidemic trends and key mutations of SARS-CoV-2 over time. This is of great significance to the development of new vaccines, the update of therapeutic drugs, and the improvement of detection methods. The AutoVEM tool developed in the present study could complete all mutations detections, haplotypes classification, haplotype subgroup epidemic trends and candidate key mutations analysis for 131,576 SARS-CoV-2 genome sequences in 18 hours on a 1 core CPU and 2GB RAM computer. Through haplotype subgroup epidemic trends analysis of 131,576 genome sequences, the great significance of the previous 4 specific sites (C241T, C3037T, C14408T and A23403G) was further revealed, and 6 new mutation sites of highly linked (T445C, C6286T, C22227T, G25563T, C26801G and G29645T) were discovered for the first time that might be related to the infectivity, pathogenicity or host adaptability of SARS-CoV-2. In brief, we proposed an integrative method and developed an efficient automated tool to monitor haplotype subgroup epidemic trends and screen for the candidate key mutations in the evolution of SARS-CoV-2 over time for the first time, and all data could be updated quickly to track the prevalence of previous key mutations and new candidate key mutations because of high efficiency of the tool. In addition, the idea of combinatorial analysis in the present study can also provide a reference for the mutation monitoring of other viruses.

(T445C, C6286T, C22227T, G25563T, C26801G and G29645T) were discovered for the 23 first time that might be related to the infectivity, pathogenicity or host adaptability of 24 SARS-CoV-2. In brief, we proposed an integrative method and developed an efficient 25 automated tool to monitor haplotype subgroup epidemic trends and screen for the 26 candidate key mutations in the evolution of SARS-CoV-2 over time for the first time, and 27 all data could be updated quickly to track the prevalence of previous key mutations and 28 new candidate key mutations because of high efficiency of the tool. In addition, the idea of 29 9, 16, 17] , but a reliable phylogenetic tree is relatively time-consuming and requires a 59 huge amount of computer resources because of bootstrapping, especially in the case of a 60 large number of genome sequence analyses [18] . According to the characteristics of virus 61 transmission and epidemic, if the frequency of mutant allele at a certain locus gradually 62 increases over time, it indicates that the mutant locus is likely to be related to viral 63 infectivity, pathogenicity or host adaptability [9] . With the global epidemic of SARS-CoV-2, 64

it is important to monitor the variation, haplotype subgroup epidemic trend and candidate 65 key mutations of SARS-CoV-2 effectively in real-time, which may help in the development 66 of new vaccines, update therapeutic drugs, and improve detection methods. Here we 67

presented an innovative and integrative method and an automated tool to monitor 68 haplotype subgroup epidemic trends and screen for the candidate key mutations in the 69 evolution of SARS-CoV-2 over time efficiently. This tool skips the process of using a large 70 number of genome sequences to construct the phylogenetic tree, and it can complete all 71 mutations detection, haplotypes classification, haplotype subgroup epidemic trends and 72

candidate key mutations analysis for 131,576 SARS-CoV-2 genome sequences in 18 73 hours on a computer with a single core CPU and 2GB RAM, which will play an important 74 role in monitoring the epidemic trend of SARS-CoV-2 and finding candidate key mutation 75 sites over time. AutoVEM is a highly specialized pipeline for quick monitoring the mutations and 80 haplotype subgroup epidemic trends of SARS-CoV-2 by using virus genome sequences 81 from GISAID. AutoVEM is written in Python language (Python 3.8.6) and runs on Linux 82 machines with centos. Bowtie 2 [19] , SAMtools [20] , BCFtools [21] , VCFtools [22] and 83

Haploview [23] were applied in this automated tool. 84

The genome sequences of SARS-CoV-2 were downloaded from GISAID 86 (https://www.epicov.org/) by 30 November 2020. All genome sequences should be placed 87 in a folder, which would be as input of AutoVEM. 88

The software was tested on a computer with a single core CPU and 2GB RAM running 90 CentOS 7. For faster analysis, we recommend that you do not use computers with lower 91 hardware resources. 92

Step1: Quality Control of Genome Sequences 94

The genome sequences were filtered out according to the following criteria: (1) 95 sequences less than 29,000 in length; (2) low-quality sequences with >15 unknown bases 96 and >50 degenerate bases; (3) sequences with unclear collection time information or 97 country information. 98

Each genome sequence passed the quality control was aligned to the reference 100 [19] . 101

SNVs and INDELs were called by SAMtools v1.10 (samtools sort) [20] and BCFtools 102 v1.10.2 (bcftools mpileup --fasta-ref, bcftools call --ploidy-file -vm -o) [21] , resulting in a 103

Step3: Further Quality Control of Genome Sequences. 105

INDELs of each sequence were extracted using VCFtools v0.1.16 (vcftools --vcf 106 --recode --keep-only-indels -stdout, vcftools --vcf --recode --remove-indels --stdout) [22] 107 and the sequences would be kept with the counts of INDELs ≤2. The SNVs were 108 extracted from the kept sequences in VCF file. 109

SNVs of each sequence were merged and the mutation allele frequency of each SNV 111 was calculated. The SNVs with mutation allele frequency < 5% (or defined frequency) 112 were filtered out. 113

Sequence 115

Nucleotides at specific sites with mutation allele frequency ≥ 5% (or defined 116 frequency) or given SNVs were extracted and organized in the order of genome pos ition. 117

Linkage analysis was performed and haplotypes with a frequency ≥ 1% were 119 obtained using Haploview v4.2 (java -jar Haploview.jar -n -skipcheck -pedfile -info -blocks 120 -png -out) [23] . The haplotype of each genome sequence was defined according to the 121 haplotype sequence, and it was defined as 'other' if it had a frequency < 1%. 122

Step7: Space-time Distribution Statistics of Haplotypes and Visualization 123

Sample information was captured from the annotation line in fasta format file of each 124 genome sequence, then the haplotype subgroups were organized according to the 125 country and collection date and the final results were visualized. 126

The specific mutation sites were annotated by an online tool of China National Center for 128 Bioinformation (https://bigd.big.ac.cn/ncov/online/tool/annotation?lang=en). 

The developed AutoVEM software has been shared on the website 132 

In total, 169,207 genome sequences of SARS-CoV-2 were downloaded from GISAID 142 (https://www.epicov.org/) by November 30, 2020, thereinto a total of 131,576 genome 143 sequences were passed at two steps of quality control, which were used for all 144 subsequent analysis. 145

Linkage and haplotype analysis of the previous 9 specific sites showed that the 9 147 sites were still highly linked (Fig 2) , and 9 haplotypes with a frequency ≥ 1% were found 148 and accounted for 99.16% of the total population (Table 1) . Thereinto, 6 of them were 149 found before January 23, 2020 and all of them were found before February 23, 2020 in 150 different countries (Fig 3) , which indicated the complexity of SARS-CoV-2 evolution and 151 spread at the early stage. The frequency and epidemic trend of haplotype subgroups of 152 the 9 specific sites showed that H1, H5, H7, H9, H10 and H11 were prevalent in the world 153 before November 30, 2020, and H1 with the frequency of 0.7184 is the most epidemic 154 haplotype subgroup (Table 1 , Fig 3) . However, H2 with a larger proportion and H3 and H4 155 with a smaller proportion at the early stage have almost disappeared at the present stage 156 (Fig 3) . By carefully comparing the base composition of the 9 specific sites of H1, H5, H7, 157 H9, H10 and H11 with those of H2, H3 and H4 (Table 1, Fig 3) , we still find that the 4 158 specific sites (C241T、C3037T、C14408T and A23403G) in Europe have an important 159 influence on the viral infectivity, pathogenicity or host adaptability. Among the prevalent 160 haplotype subgroups, H5, H7, H9 and H11 all had A23403G mutations, which indicated 161 that the single A23403G mutation was related to infectivity, pathogenicity or host 162 adaptability of SARS-CoV-2, while H10 had the other three specific mutations, including 163 C241T, C3037T and C14408T, indicating that the combined mutation s of these 3 sites 164 also had a certain impact on infectivity, pathogenicity or host adaptability of SARS-CoV-2. 165

However, H1 had these four mutations at the same time and showed an absolute 166 epidemic advantage, which indicated that the simultaneous mutation of these four sites 167 had a cumulative effect on infectivity, pathogenicity or host adaptability of SARS-CoV-2. 168

A total of 23 SNVs with a frequency ≥ 5% were filtered from 131,576 SARS-CoV-2 171 genomes ( Table 2 ). According to the linkage analysis of the 23 sites, it showed that not all 172 of them were highly linked (Fig S1) . The 23 haplotypes with a frequency ≥ 1% were found 173 and accounted for only 87.07% of the total population (Table S1 ). The frequency 174 distribution of 23 haplotypes was relatively dispersed and the haplotype subgroup 175 epidemic trends were complex (Table S1 , Fig S2) , which was difficult to find the regular 176

Among the 23 sites, we found that 6 mutation sites (T445C, C6286T, C22227T, 178

G25563T, C26801G and G29645T) with a frequency greater than 20% might be highly 179 linked ( Table 2 , Fig S1) . Therefore, we performed linkage analysis of the 6 sites separately, 180 and only 4 haplotypes with a frequency greater than 1% were found and accounted for 181 99.50% of the total population (Fig 4) , which suggested that the 6 sites were indeed highly 182 linked. Among these 23 sites, except for the 4 specific sites (C241T 、C3037T、C14408T 183 and A23403G) of the previous H1 haplotype with mutation frequency greater than 0.8, 184

only the above 6 mutation sites of highly linked had higher frequencies, which indicated 185 that the mutations of these 10 sites were more significant at the present stage. Therefore, 186

we constructed haplotypes by combining the previous 4 specific sites and the 6 new sites 187

to reveal the landscape of virus continuous evolution (from the early to th e present stage). 188

According to the linkage analysis and haplotype frequencies of the 10 sites (including 190 the 4 specific sites of previous haplotype H1 and the 6 new sites), 11 haplotypes with a 191 frequency ≥ 1% were found and accounted for 95.14% of the total population (Fig 5, 192 Table 3 ). Among them, the three haplotypes (H1-1, H1-2 and H1-3) derived from the 193 previous H1 accounted for 71.23% of the population (Table 3) Table 3 , Fig 6) . Besides, the G25563T mutation was also found in 199 several other haplotypes H9-2, H5-2 and H7-2 ( Table 3) In general, the haplotype subgroups at the later stage are more complex and diverse 204 than those at the earlier stage (Fig 3, Fig 6, Fig S2) , which may be related to the larger 205 population and the more complex genome diversity. In addition, we observe d that there 206 are more complex haplotypes in the United States. 207 208

According to linkage analysis and haplotype subgroup epidemic trends of the 210 previous 9 specific sites for 131,576 genome sequences, we found that the frequency of 211 H1 increased from 0.2880 (March 22, 2020), 0.4540 (April 6, 2020) and 0.6083 (May 10, 212 2020) at the early stage [9] to 0.7184 (November 30, 2020) at the present stage (Table 1) . 213

Moreover, both the single mutation of A23403G or the combined mutation s of C241T, 214

C3037T and C14408T could influence the infectivity, pathogenicity or host adaptability of 215 SARS-CoV-2, which further confirmed the 4 specific sites (C241T, C3037T, C14408T and 216 A23403G) were important in the previous H1 haplotype [9] . The mutation of A23403G 217 located in S genes resulted in amino acid change of 614D>G (Table 2) , which has been 218 proved to be related to infectivity by several in vitro experiments [10] [11] [12] [13] [14] [15] . Therefore, it is 219 strongly recommended that the A23403G mutation should be taken into account in the 220 development of SARS-CoV-2 vaccines, especially RNA vaccines. 221

Since mutations in the virus genome occur randomly, the frequencies of most 222 mutations in the population could not reach a certain level, and they should have no 223 impact on infectivity, pathogenicity or host adaptability of virus. In the present study, we 224 screened the SNVs with the mutation frequency of more than 5% and only 23 SNVs were 225 screened in 131,576 SARS-CoV-2 genomes, which indicated the random mutation 226 phenomena of SARS-CoV-2 genome. In the linkage and haplotype analysis of the 23 sites, 227

we found the haplotypes of the 23 sites were complicated and could not find a specific 228 haplotype subgroup trend (Table S1 , Fig S2) , which suggested that the selection of 5% 229 mutation frequency might not be appropriate at the present stage. Then we focused on 230 the 6 mutations (T445C, C6286T, C22227T, G25563T, C26801G and G29645T) with a 231 frequency greater than 20% among 23 sites. It was found that the 6 sites were highly 232 linked and only 4 haplotypes with a frequency greater than 1% were found and accounted 233 for 99.50% of the total population (Fig 4) . A few haplotype subgroups represent almost all 234 of the population, indicating that the linkage analysis of appropriate sites can reduce 235 haplotype subgroup complexity. Moreover, the frequencies of two mutated haplotypes 236 were 0.2160 and 0.2000 (Fig 4) , suggesting that these 6 sites might be valuable. 237

Therefore, in the practice of using our tool to screen the candidate key sites, we can 238 adjust the setting frequency according to the situation. and G29645T). Therefore, the use of linkage analysis in this study can provide some 263

highly linked co-mutation information, which combined with haplotype subgroup epidemic 264 trends over time, can provide a more comprehensive assessment of which mutations may 265 contribute significantly to infectivity, pathogenicity or host adaptability of SARS-CoV-2. In 266 addition, previous researchers identified several other mutations of N439K, Y453F and 267 N501Y in the S protein, but we did not [24, 25] . The possible reason is that the genome 268 data analyzed was different. Besides, these mutations had a frequency of less than 5% 269 according to their data [24, 25] , while our analysis filtered out mutation sites with a 270 frequency of less than 5%. Among them, the frequency of N501Y mutation seems to 271 reach 10% in the latest 28 days (13/11/2020 -10/12/2020), whether the frequency of the 272 mutation will increase over time should be monitored to further infer the significance of the 273 mutation. Our tool can be utilized to analyze any local or public SARS-CoV-2 genomes for 274 real-time monitoring virus mutations and epidemic trends. Table 1 Haplotypes and frequencies of the previous 9 specific sites Table 2 The information of the 23 sites with a mutation frequency greater than 5% Table 3 Haplotypes and frequencies of the 10 sites The numbers of haplotype subgroups of the previous 9 specific sites for 131,576 genomes with clear collection data detected in each country or region in chronological order.

Countries or regions with a total number of genomes ≤ 100 were not shown in the figures. 

COVID-19 Weekly Epidemi ological Update

Characteristics of SARS-CoV-2 and COVID-19

Phylogenetic network analysis of SARS-CoV-2 genomes

On the origin and continuing evolution of SARS-CoV-2

Emergence of genomic di versity and recurrent mutations in SARS-CoV-2

GESS: a database of global evaluation of SARS-CoV-2/hCoV-19 sequences

MicroGMT: A Mutation Tracker for SARS-CoV-2 and Other Microbial Genome Sequences. FRONT MICROBIOL 2020

CoV-GLUE: A Web Application for Tracking SARS -CoV

Genomic Variation

Comprehensi ve evol ution and molecular characteristics of a l arge number of SARS -CoV-2 genomes reveal its epidemic trends

San jana NE: The S pike D614G mutation increases SARS-CoV-2 infection of multiple human cell types

Bimodular effects of D614 G mutati on on the s pike glycoprotein of SARS-CoV-2 enhance protein processing, membrane fusion, and viral infecti vi ty

Structural Impact of Mutati on D614 G in SARS -CoV-2 S pike Protein: Enhanced Infecti vity and Therapeutic Opportunity

The Impact of Mutations i n SARS-CoV-2 S pike on Viral Infecti vi ty and Antigenicity

The D614G mutation in the SARS-CoV-2 s pike protein reduces S1 sheddi ng and increases infecti vity

Structural and Functional Analysis of the D614G SARS -CoV-2 S pike Protein Variant

A SARS-CoV-2 vaccine candi date woul d likely match all currently circulating variants

S patio-Temporal Mutational Profile Appearances of S wedish SARS -CoV-2 during the Early Pandemic

Stamatakis A: Parallel computation of phylogenetic consensus trees

Fast gapped-read alignment with B owtie 2

The Sequence Alignment/Map format and SAMtools

A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter esti mation from sequencing data . BIOINFORMATICS 2011, 1. An automatic tool to quickly analyze the SARS-CoV-2 epidemic trend and key mutation sites 325 were developed

Another 6 key mutation sites of highly linkage were found through 131,576 SARS-CoV-2 327 genome sequences

Continuous evolution and epidemic trends of 131,576 SARS-CoV-2 genome were indicated. 329 Declarations of interest: none

Xi Binbin: Software, Validation, Data Curation, Visualization, Investigation, Writing -Original Writing -Review & Editing. Lin Shudai: Writing -Review & Editing. Hu Meiling: Data Curation

Funding acquisition, Supervision, Writing -Review & Editing 339