key: cord-0816917-trm2hw8f
authors: Belinsky, Alexandra; Kouzaev, Guennadi A.
title: Visual and Quantitative Analyses of Virus Genomic Sequences using a Metric-based Algorithm
date: 2022-01-12
journal: bioRxiv
DOI: 10.1101/2021.06.17.448868
sha: 71c44a82a096e7c7c32f1e2463d7904938875566
doc_id: 816917
cord_uid: trm2hw8f

This work aims to study the virus RNAs using a novel accelerated algorithm on exploring any-length genomic fragments in sequences using Hamming distance between the binary-expressed RNA symbols and explored pattern characters. The found repetitive genomic sub-sequences of different lengths were placed on one plot as genomic trajectories (walks) to increase the effectiveness of geometrical multi-scale genomic studies. Primary attention was paid to the building and analysis of the atg-triplet walks composing the schemes or skeletons of the viral RNAs. The 1-D distributions of these codon-starting atg-triplets were built with the single-symbol walks for full-scale analyses. The visual examination was followed by calculating statistical parameters of genomic sequences, including the estimation of geometry deviation and fractal properties of inter-atg distances. This approach was applied to the SARS CoV-2, MERS CoV, Dengue and Ebola viruses, whose complete genomic sequences are taken from GenBank and GISAID databases. Relative stability of these distributions for SARS CoV-2 and MERS CoV viruses was found, unlike the Dengue and Ebola distributions that showed increased deviation of their geometrical and fractal characteristics of atg-distributions.

A virus is a tiny semi-live unit carrying genetic material (RNA or DNA -double-helix RNA structure) in a protein capsid covered by a lipid coat. The virus penetrates the cell wall and urges this bio-machine to 'manufacture' more viruses.

Some viruses are RNA-based and transfer the genetic information by long chains of four organic acids, namely, Adenine (a), Cytosine (c), Guanine (g) and Uracil (u) [1] . DNA-based viruses and double-stranded genetic polymers carry the information by four nucleotides, but one of them is Thymine (t) instead of Uracil. In genomic databases, anyway, even the single-stranded viral RNAs are registered as the complementary chains where Thymine substitutes Uracil due to some instrumental specifics [2] that do not hinder the mathematical aspects of the virus theories. These complimentary RNAs will be used further for numerical modelling in our paper.

A complete RNA is a chain of codons (exons) used to transfer genetic information and introns.

Unfortunately, the role of the latter is not well known [3] . Sequencing of RNA or DNA is searching and identifying nucleotides by instrumental means. Codons in RNAs start with an 'aug' combination of nucleotides and end with one of the following three combinations: 'uaa', 'uag', or 'uga'.

Some DNA strands consist of billions of nucleotides, so mathematical methods are widely used in genomics [4] . For instance, the RNA symbols are substituted by number values, and this process is called DNA/RNA mapping [5] - [8] . For example, in Ref. [5] , eleven methods of numerical representation of genomic sequences are listed and analysed to conclude that each of them is preferable in a particular application, and no universal mapping algorithm is equally advantageous for all genomic study. Different retrieval algorithms can be applied to genomic sequences, including signal-processing means [7] , [9] - [14] . The numerical RNAs can be shown graphically for qualitative analyses. For instance, each nucleotide is represented by a unit vector in a 4-dimensional (4-D) space, and an imaginary walker moves along an RNA sequence, making a trajectory in this space. To avoid apparent difficulties with plotting such 4-D walks, the nucleotides are combined in a certain way with each other [15] . For instance, each nucleotide is associated with a one from four unit vectors in 2-D space, which projections on the plane axes can take positive (+1) or negative (-1) values [16] , [17] . A trajectory is built moving along the consecutive number of a nucleotide in the studied genomic sequence. A one-dimensional walk is realised when purines (a or g) are associated with one step down and pyrimidines (t or c) with one step up. In general, DNA walks allow the detecting of codons and introns, discovering hidden RNA periodicity [12] - [14] and calculating phylogenetic distances between genomic sequences [18], among others. Some additional results and reviews on DNA imaging can be found, for instance, in Refs.

[19]- [22] , where the necessity to use specified walks for each class of genomic problems is shown.

Genomic walk analysis can be followed by calculating fractal properties of distributions of nucleotides [23]- [32] . Fractals are self-similar or scale-invariant objects. It means that small 'subchain' geometry is repeated on larger geometry scales, although randomly distorted. A biopolymer chain in a solution is bent in a fractal manner [26] . This fractality influences the chemical reaction rate, diffusion and surface absorption of long-chain and globular molecules, among others Although many achievements are known in the numerical mapping of RNAs, some questions have not been resolved. The genomic codes are designed to track single nucleotides, which overloads these plots for visual analysis.

Meanwhile, the complete RNAs of viruses are composed mostly of codons, and one repetitive pattern therein is the atg-triplet. We concluded that these triplets build the viral RNA scheme or skeletons. A pattern search algorithm calculates the triplet distributions along an RNA sequence.

Additionally, the same algorithm can make the walks of each of the four nucleotides. These trajectories, being imaged by different graphical means on the same figure and equipped with interactive links to the names of genes, will make visual analyses much more effective, and the results on the creation of such a tool are given here. These codes and graphic means can be applied for research and practical applications. One of them is the studies of the stability of mentioned atgschemes towards mutations, variation of codon fillings and the fractality of atg-distributions, among others.

In Section 2, the developed calculation algorithms and plotting techniques are considered in detail. The results of using these techniques to the SARS CoV-2, MERS CoV, Dengue and Ebola viruses are in Section 3. They are discussed in Section 4, and conclusions are rendered in Section 5. The text is followed by a list of more than 70 references. In Appendix 1, all necessary data for the analysed virus RNAs taken from GenBank and GISAID are given in a tabular form, including the parameters calculated in this contribution.

In this paper, the complete genomic sequences are studied taken from GenBank [39] and GISAID [40] . Among them, 21 SARS CoV-2 genomic sequences from GISAID and one from GenBank, ten genomic sequences for the MERS coronavirus (GenBank), 25 genomic sequences for the Dengue virus from GenBank and 15 Ebola virus genomic sequences (GenBank). Data from the GISAID database are available after registration. All names of genomic sequences are given in figure legends and in Tables 1-4.

As it has been stated above, for both DNA and RNA descriptions by characters, their alphabet consists of four nucleotides. These designations are used to study RNA and DNA if their physicochemical properties are outside the research scope.

In many cases, the RNAs/DNAs have repeated patterns of nucleotide sequences, and these regions are better conserved in mutations [9] . Systems of these repeating fragments are considered as skeletons or schemes of these chains [41], [42] . As a rule, pattern discovery relates to nondeterministic polynomial time problems (NP-problems), i.e. solution time increases exponentially with the sequence length.

A typical algorithm compares a query character pattern with a length of nucleotides with the following one-symbol shift of query along a chain. In our code, we use these techniques. However, we represent all RNA symbols by binaries before comparisons to decrease the number of processor operations. It speeds up computation 1.43-2.37 times according to the general evaluation of arithmetic operations in computers [43] , [44] .

In computers, for instance, the UTF-8 format allows encoding all 1,112,064 valid character code points, and it is widely used for the World Wide Web [45] . The first 128 characters (US-ASCII) require only one byte (eight binary numbers) in this format. If binary units initially represent the DNA sequences, then calculating the DNA sequence's numerical properties in this form reduces the computation time.

Because binary sequences now write the DNA/RNA chains, they can be characterised quantitatively using a suitable technique. One calculates a metric distance between the binaryrepresented symbols and a query (base) 'moving' along a chain. Many metric types are used in codes and big data [46] - [54] . The advantage of using metric estimates is that they can be applied in cluster analysis for similar grouping nucleotide or protein distribution patterns. For instance, it can help to class the virus RNAs [53] . Particularly, this distance can be the Hamming one [46] , [47] used further in this paper.

Consider a flow chart of our code on exploring patterns of arbitrary length (Fig. 1 AB  is used, and the total number of '1's in the resultant string C is counted. If the compared binary-represented symbols are the same, the distance value is zero. Only n characters are compared on each count; then, the query is moved on one symbol, and calculations are repeated. Then, a string C of numerical estimates of the length M n N = is a product of step 3 of this code. Not all registered genomic sequences are divisible by n . In this case, a needed number of characters a is added to the end of the sequence A , or the Hamming metric operation can be fulfilled using Levenstein's distance formula from Ref. [49] , workable for compared sub-strings of arbitrary length [52].

The following two parts of our algorithm are with calculation of numbered ( ) i y query positions in the RNA sequence i x according to the Hamming-distance data. All zeros in the string C are initially obtained (step 4, Fig. 1 ). Then, only n neighbouring zeros corresponding to a query are selected (step 5, Fig. 1 ), and this query is numbered in a sequential manner starting with the first one found in RNA. The positions i x of these numbered queries i y in a complete RNA sequence are calculated analytically (step 6, Fig. 1 ). Let us take the coordinate of the first symbol in a numbered found query, then a set of points can be built along a studied sequence. These points, being connected, make a curve called a query walk.

In this paper, the start-up atg-triplet is used below as a query. We define the positions i x of the first symbols of the sequentially numbered atg-triplets in an RNA sequence A , and the atg-walks are plotted. Additionally, we calculate the word length ( ) , 1 1 atg i i i i l x x ++ =− . In our algorithm, a 'word' is a nucleotide sequence starting with an atg-triplet and all symbols up to the next atg-one (Fig. 1 , right part).

The proposed algorithm was realised in the Matlab environment [56] , and it is a few-ten-line code. The following Matlab library functions were used: 

, plot y

The developed algorithm was applied to many available virus complete genomes to validate it, and the calculated atg-positions were compared with those available from databases. 

The viral RNAs, consisting of thousands of nucleotides, are challenging to analyse, and many visualising methods are used. Among them are plotting the DNA walks projected on the spaces of 

It is necessary to see full-scale virus RNA maps and analyse all types of mutations. Previously, the most attention was paid to mapping atg-triplets, thinking that they constructed a skeleton of RNA, a relatively stable structure. Besides the structural mutations changing the atg-distributions, the nucleotides vary their positions inside codons. Our algorithm considers even a single symbol as a pattern, and it allows the calculation of distribution curves for each nucleotide similarly to atg-ones.

These curves can be considered as the first level of spatial detailing of RNAs. The words in our definitions (see Fig. 1 ) compose the second level. Some words compose a gene responsible for synthesising several proteins, and the genes belong to the third level of spatial detailisation of RNAs.

A combined plotting of elements of the hierarchical RNAs organisation will be helpful in the visual analyses of RNAs and DNAs. One of the ways is shown in Fig. 3 . Here, positions of a-symbols of atg-triplets in an RNA sequence are given by vertical lines (second level of detailisation). Words take spaces between these vertical lines (see Fig. 1 ). They are filled by numbered nucleotides (points of a different colour), which are the first level of RNA detailisation. This allows distinguishing nucleotides even at the beginning of coordinates where crowdedness is seen (inlet in Fig. 3 ).

The next level of hierarchical RNAs organisation is with genes. For instance, in GenBank [39], the list of symbols of an RNA sequence in FASTA format is followed by a diagram where the genes are given by horizontal bars with the gene's literal designations. In our case, this diagram can be attached to a two-scale plot considered above. Another solution is to equip our figures with gene hyperlinks, an interactive means highlighting whose genes a nucleotide or codon belongs to, shown by a pointer. This application for third level visualisation is currently under development. 

In many previous studies, the fractality of distributions of nucleotides along with the DNA/RNA sequences has been studied [6] , [60] . The motifs of small-size patterns are repeated on large-scale levels. Thus, the nucleotide distribution along a genome is not entirely random due to this long-range fractal correlation, as is mentioned in many papers. The measure of self-similarity is its fractal dimension F d that can be calculated using different approaches.

The large-size genomic data are often patterned, and each pattern can have its fractal dimension,

i.e. the sequences can be multifractals [31] . This effect is typical in genomics, but it is also common in the theory of nonlinear dynamical systems, signal processing and brain tissue morphology, among others [50] - [63] .

Discovering the fractality of genomic sequences is preceded by their numerical representation, walk is considered a sample of a continuous function, and the methods of signal processing theory are applied [12] , [13] .

In our case, a 'signal sample' is a word-length ( ) Table 1 , Appendix 1).

In this paper, the fractal dimension was calculated using a software package FracLab 2.2 [64] to compute the parameters of sequences of samples. Although many researchers tested this code, it is again verified to calculate the Weierstrass function, which is synthesised according to a given value of the fractal dimension [65] . This code provides results with reasonable accuracy if the default parameters of FracLab are used.

In a strong sense, the fractal dimension was defined for the infinite sequences. In our case, the ones have only 730 -266 atg-triplets depending on the virus. Then, the fractal dimension values were estimated approximately. Comparing the Weierstrass function parameter calculations, the error in the worst case of the smallest atg-number can be on the order of several percent. This is acceptable for our analysis of even short-length RNA sequence viruses like Dengue (Table 3 ).

In this research, essential attention was paid to the SARS CoV-2 complete RNA genome sequences. A recent comprehensive review on the genomics of this virus can be found in Refs. [66] , [67] , for instance. The data used here and throughout this whole paper are from two genetic databases: GenBank [39] and GISAID [40] . A part of the studied genome sequences for this and other viruses is provided in Appendix 1.

Here, the main unit, called a 'word', is a nucleotide sequence starting with 'atg' and the symbols up to the next starting triplet (Fig. 1, right part) . The number of atgs was calculated by our code and Table 1 Although the difference between these curves is not significant, the mutations may have complicated consequences in the rate of contagiousness of viruses. Table 1 , Appendix 1). The inlets show these atg-distributions at the beginning and end of genome sequences.

There are different techniques for numerical comparing sequences known from data analytics, including, for instance, calculation of correlation coefficients of unstructured data sequences, data distance values, and clustering of data, among others [10] , [69] . Researching RNA sequences, we suppose that the error of nucleotide detection is essentially less than one percent; otherwise, the results of comparisons would be instrumentally noisy. Therefore, if a compared sequence has several atg-triplets fewer than the number of atg-ones in a reference sequence, the atgs of reference RNA are excluded from comparisons. Still, it allows for obtaining some information on mutations of viruses in a straightforward and resultative way that will be seen below.

Our approach supposes choosing a reference nucleotide sequence to compare the genomic virus data of other samples, and it is a complete genomic sequence MN988668.1 from GenBank (row 1, Table 1 ). Several virus samples from GenBank and GISAID have been studied in this way [42] , and some results of comparisons are given in Fig. 7 . The ordinate axis x  in these plots shows the deviation of coordinates i x of atg-triplets from the atg-coordinates of the reference sequence. As a rule, due to the different number of noncoding nucleotides at the beginning of complete RNA sequences, the curves in Fig. 7 Our study shows that these difference curves (Fig. 7) are individual for the studied samples.

Although mutations without affecting the atg-distributions are possible, this individuality, theoretically, may be lost.

There are repeating motifs of comparison curves ( Fig. 7 and Refs. [42] , [68] ). The origin of this is unknown, but it was not coupled with the lineages of viruses and their clades. Other viruses can be studied similarly.

The Middle East Respiratory Syndrome-related (MERS) is a viral respiratory illness. The virus' origin is unknown, but it initially spread through camels and was first registered in Saudi Arabia [70] and spread along the world later [71] . Most people infected with the MERS CoV virus developed a severe respiratory disease, which resulted in multiple human deaths.

Our simulation of atg-distributions of this virus shows compactness of the calculated curves (Fig.   8 , and Table 2 , Appendix 1), like the SARS CoV-2 characteristics. It follows that both viruses demonstrate relatively stable features towards the strong mutations connected with the recombination of the virus's parts. For instance, the divergence of these curves is estimated at around 1% only. On average, the MERS RNAs have fewer atg-triplets and longer nucleotide words than the SARS CoV-2 studied sequences. Table 2 , Appendix 1). The inlets offer these atg-distributions at the beginning and end of genome sequences. The numbers of virus atg-curves correspond to Table 2 , Appendix 1.

In general, the two studied coronaviruses (MERS CoV and SARS CoV-2) demonstrate relatively strong stability of their atg-distributions towards severe mutations, leading to the variation of codon positions, word lengths and word numbers. This follows the conclusions of many scientists working in virology and virus genomics [72] .

The Dengue virus is spread through mosquito bites. For instance, a recent comprehensive review on the genomics of this virus can be found in Refs. [73] , [74] . Unlike the coronaviruses, the Dengue virus ( Table 3 , Appendix 1) shows the atg-distributions of five sequences of Dengue virus-1 found in China. A rather large dispersion of these sequences is seen from these graphs. Table 3 , Appendix 1.

In Fig. 10A (rows 11-15, Table 3 , Appendix 1), five data sets for different strains of Dengue virus-3 registered in many countries are shown. They have about the same number of nucleotides and comparable averaged lengths of words.

In Fig. 10B (rows 16-18, Table 3 , Appendix 1), three atg-distributions of a Gabon-strain [74] of Dengue virus-3 are given. It is supposed that this strain mutated from the earlier registered Gabon Dengue virus lines (Fig. 10A ). However, they are different in the length of complete genome sequences and their statistical characteristics, which are considered in Section 3.5 below. Table 3 , Appendix 1) and Dengue-4 (rows 21-25 -(B), Table 4 , Appendix 1).

A consolidated plot of all atg-curves of the Dengue RNAs studied here is shown in Fig. 12 

There are four strains of Ebola virus known in the world, although many other mutations of this virus can be found. Like the Dengue virus, the Ebola virus shows instability and an increased rate of mutations. Initially, the infection was registered in South Sudan and the Democratic Republic of the Congo, and it spreads due to contact with the body fluids of primates and humans. This fever is distinguished with a high death rate (from 25% to 90% of the infected individuals). A recent comprehensive review on the genomics of this virus can be found in Ref. [75] .

The Ebola virus RNA consists of 19 000 nucleotides and more than three hundred atg-triplets. Fig.   13A shows four sequences of this virus belonging to the EBOV strain registered from Zaire and Gabon. Three of them are very close to each other, but the mutant Zaire virus (in red) has some differences from the three others. The samples collected in Sudan (SUDV) are closer to each other ( Fig. 13B ), but they have an increased number of atg-triplets and shorter words. Table 4 , Appendix 1) and Ebola virus -SUDV from Sudan (rows 5-7 -(B), Table 4 , Appendix 1).

Africa. The atg-distributions of the five RNA sequences studied here are different even visually from the two reviewed above, as seen in Fig. 14A . Another Ebola virus strain that can be compared with the one studied above is the Bundibugyo (BDBV) virus, whose three atg-distributions are shown in Fig. 14B . Table 4 , Appendix 1.

The calculated distributions are consolidated in Fig. 15 to compare all four strains, where, instead of points, the results are represented by thin curves to make these distributions more visible. Here, the tendency of atg-curves to divergence and forming clusters is seen. For instance, the deviation of strains estimated at 18000 i x = is around 9%. Table 4 , Appendix 1.

Reviewing all above-obtained results, the atg-walk is an effective visualisation tool sensitive to the viral RNA mutations connected with the number of codons' variation, word width, and atgcoordinates. It allows to detect the viruses with essentially unstable genomes distinguished by their increased deviation of atg-walks and their fractal properties.

In this research, after applying the above-mentioned tool FracLab (See Section 2. The Dengue virus has five families and 47 strains; they have different atg-distributions and fractal dimensions. Some strains are close to each other according to the fractal calculations (Fig. 17) . This gives a reason to conclude that the RNAs of the considered strains have similarities in the atgdistributions. The same conclusion is evident in Fig. 18 , where the fractal dimensions of several strains of the Ebola virus are given. atg ii l + is coupled in a certain way with the fractal dimension. As a rule, the word-length reduction increases the fractal dimension, which means a more complicated distribution of atgtriplets.

A comparative analysis of Fig. 16 gives us that the RNAs of some viruses of the same strain have a somewhat stable fractal dimension value F d . This is typical for the studied samples of the SARS CoV-2 and MERS CoV viruses.

It is known that the Dengue and Ebola viruses have increased mutation rates. If we follow the contemporary classification of these viruses, they have essentially different fractal dimension values even for the samples belonging to the same strain (Fig. 17, Dengue 3 and Fig. 18, Ebola) , which points to the increased variability of these viruses.

The research on the RNAs and DNAs of viruses and cellular organisms is a highly complex problem because of the many nucleotides of these organic polymers, unclear mechanisms of their synthesis and pathological mutation consequences for host organisms. Although many mathematical tools have been developed, new studies are exciting and can be fruitful.

In this paper, the viral RNAs were studied using a novel algorithm based on exploring RNA patterns of arbitrary length. One of the operations of this algorithm is with the numerical mapping of RNA characters, which is performed by calculating the Hamming distance between the preliminary binary-expressed queries and RNA symbols. This allows fulfilling these steps approximately twice as fast regarding the operations with real numbers [43] . 

In this paper, the visual and quantitative analyses of viral RNAs were performed using a novel algorithm to calculate the RNA pattern positions in the studied sequences. A part of this code uses 

RNA sequencing and analysis

An expanding universe of noncoding RNAs

Introduction to Computational Genomics: A Case Studies Approach

Numerical representation of DNA sequences

Complex representation of DNA sequences

Conversation of nucleotides sequences into genomic signals

Vector representation and its application of DNA sequences based on nucleotide triplet codons

Introduction to Bioinformatics

Visualization for Information Retrieval

Milestones in graphical bioinformatics

Genomics and proteomics: A signal processing tour

Digital signal processing in the analysis of genomic sequences

An integrated approach for identification of exon locations using recursive Gauss-Newton tuned adaptive Kaiser window

Geometrical study of virus RNA sequences

Hardware-software co-design for decimal multiplication

Comparison between binary and decimal floatingpoint numbers

General Structure. The Unicode Standard (6.0 ed.)

Error detecting and error-correcting codes

Pulse Code Modulation Techniques

Flexible Pattern Matching in Strings: Practical Online Search Algorithms for Texts and Biological Sequences

Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady

Theory of codes with maximum rank distance

Genetic cluster analysis of SARS-CoV-2 and the identification of those responsible for the major outbreaks in various countries

Multidimensional scaling for large genomic data sets

Online Text Tools

Measuring the strangeness of strange attractors

Chaotic Dynamics of Nonlinear Systems

Nonlinear Dynamics Time Series Analyses, In: Nonlinear Biomedical Signal Processing: Dynamic Analysis and Modeling

A regularization approach to fractional dimension estimation

Signal and image processing with Fraclab

Application of Advanced Electromagnetics. Components and Systems

Does a self-similarity logic shape the organization of the nervous system? In: The Fractal Geometry of the Brain

FracLab 2.2. A fractal analysis toolbox for signal and image processing

Weierstrass cosine function (WCF)

Genetics and genomics of SARS-CoV.2: A review of the literature with the special focus on genetic diversity and SARS-CoV-2 genome detection

Phylogenic network analysis of SARS-CoV-2 genomes

The geometry of ATG-walks of the Omicron SARS CoV-2 Virus RNAs, bioArxiv preprint: bioRxiv doi

Transforming unstructured data into useful information

Enzootic patterns of Middle East respiratory syndrome coronavirus in imported African and local Arabian dromedary camels: a prospective genomic study. The Lancet Planetary Health

An infectious cDNA clone of a growth attenuated Korean isolate of MERS coronavirus KNIH002 in clade B

The coronavirus variants don't seem to be highly variable so far

Genomics, proteomics and evolution of dengue virus

Reemergence of Dengue virus serotype 3 infections in Gabon in 2016-2017, and evidence for the risk of repeated Dengue virus infections

Viral genomics in Ebola virus research

S, A.1, Illumina NovaSeq4000 Cambodia/RShSTT182/2010, A.1, (bat virus

/Austria/CeMM3224/2021, GR, B.1.1.244, Illumina NovaSeq

GV, B.1.221, Illumina MiSeq RS-00674HM_LMM52649/2020, GR, B.1.1.33, Illumina Miseq

MiSeq 20 hCoV-19/Canada/ON-NML-254107/2021, GR, BA.1 (Omicron)

MW386865.1, Dengue virus 1 isolate YNBN04

MN566112.1, Dengue virus 2 isolate

Sanger dideoxy 10721 274 28 52.67 2.45 10 MH069499.1, Dengue virus 2 strain DENV-2/VE/IDAMS/910105, Venezuela

Dengue virus 3 isolate 449686_Antioquia_CO_2015, Colombia

Sanger dideoxy sequencing 19 LC379196.1, Dengue virus 3 strain SYMAV-09/Gabon/2016 genomic RNA

Dengue virus 3 strain SYMAV-07/Gabon/2016 genomic RNA

Illumina 10649 273 26 53.12 2.09 22 MG272274.1, Dengue virus 4 isolate D4/IND/PUNE/IRSHA-FG-03 (S-49), complete genome

KU174137.1, Mutant Zaire ebolavirus isolate Ebola virus/H.sapiensrec/COD/1976/Yambuku-Mayinga-eGFP-BDBV_GP

Ebola virus strain Ebola virus/M.fasciculariswt/GAB/2001/untreated-CCL053D5

Ebola virus strain Ebola virus/M.fascicularis-wt/GAB/2001/100mg-CA470D5

Sudan ebolavirus isolate Ebola virus/H.sapiens-tc/Sudan/1976/Boniface-R4142L

Sudan ebolavirus isolate Ebola virus/H.sapiens-wt/SSD/1976/Maridi-BNI/DT

Sudan ebolavirus isolate Ebola virus

NC_039345.1, Bombali ebolavirus isolate Bombali ebolavirus/Mops condylurus/SLE/2016/PREDICT_SLAB000156

MW056492.1, Bombali ebolavirus isolate X030

MF319186.1, Bombali ebolavirus isolate Bombali virus/C.pumiluswt/SLE/2016/Northern Province-PREDICT_SLAB000047

MK340750.1, Bombali ebolavirus isolate B241

MW056493.1, Bombali ebolavirus isolate Z153

Bundibugyo ebolavirus isolate Ebola virus/H.sapienstc/Uganda

Bundibugyo ebolavirus isolate Ebola virus/H.sapienstc/Uganda/2007/Bundibugyo-R4386L

Bundibugyo ebolavirus isolate Ebola virus/H.sapienstc/Uganda

The authors thank the GenBank® [39] and GISAID [40] genetic data banks, and all researchers placed their genomic sequences in them. The online text processing service of https://onlinetexttools.com/ is appreciated.

All authors are contributed equally

The authors declare that they have no conflicts of interest that are relevant to this research paper.