key: cord-0720645-7t1iqrd1
authors: Ekpenyong, Moses Effiong; Edoho, Mercy Ernest; Inyang, Udoinyang Godwin; Uzoka, Faith-Michael; Ekaidem, Itemobong Samuel; Moses, Anietie Effiong; Emeje, Martins Ochubiojo; Tatfeng, Youtchou Mirabeau; Udo, Ifiok James; Anwana, EnoAbasi Deborah; Etim, Oboso Edem; Geoffery, Joseph Ikim; Dan, Emmanuel Ambrose
title: A hybrid computational framework for intelligent inter-continent SARS-CoV-2 sub-strains characterization and prediction
date: 2021-07-15
journal: Sci Rep
DOI: 10.1038/s41598-021-93757-w
sha: 68b16ed148973f50c56c77e2b2dc109bbeba4cb8
doc_id: 720645
cord_uid: 7t1iqrd1

Whereas accelerated attention beclouded early stages of the coronavirus spread, knowledge of actual pathogenicity and origin of possible sub-strains remained unclear. By harvesting the Global initiative on Sharing All Influenza Data (GISAID) database (https://www.gisaid.org/), between December 2019 and January 15, 2021, a total of 8864 human SARS-CoV-2 complete genome sequences processed by gender, across 6 continents (88 countries) of the world, Antarctica exempt, were analyzed. We hypothesized that data speak for itself and can discern true and explainable patterns of the disease. Identical genome diversity and pattern correlates analysis performed using a hybrid of biotechnology and machine learning methods corroborate the emergence of inter- and intra- SARS-CoV-2 sub-strains transmission and sustain an increase in sub-strains within the various continents, with nucleotide mutations dynamically varying between individuals in close association with the virus as it adapts to its host/environment. Interestingly, some viral sub-strain patterns progressively transformed into new sub-strain clusters indicating varying amino acid, and strong nucleotide association derived from same lineage. A novel cognitive approach to knowledge mining helped the discovery of transmission routes and seamless contact tracing protocol. Our classification results were better than state-of-the-art methods, indicating a more robust system for predicting emerging or new viral sub-strain(s). The results therefore offer explanations for the growing concerns about the virus and its next wave(s). A future direction of this work is a defuzzification of confusable pattern clusters for precise intra-country SARS-CoV-2 sub-strains analytics.

However, the dissimilarity in genome sequencing of early viral samples obtained from infected individuals in European, North American, Asian, and Oceanian regions 7 disgorged several studies aimed at analyzing and understanding the evolutionary history and relationships among the different SARS-CoV-2 strains.

SARS-CoV-2 is a β-coronavirus-an enveloped non-segmented positive-sense RNA virus (subgenus-sarbecovirus, subfamily-Orthocoronavirinae) 8 , which proliferation begun in December 2019 in Wuhan China. It has since been confirmed that two strains of the new coronavirus (the L-and S-strains) are spreading around the world today 9 , and the fact that the L-type is more prevalent suggests that it is "more aggressive" than the S-type. Greater proportion of research progress on SARS-CoV-2 has taken the biotechnology dimension 10, 11 , specifically focusing on species characterization and variants analysis through features extraction. Consequently, Artificial Intelligence (AI) and Machine Learning (ML) methods are expanding biotechnology capacity into the bioinformatics realm, through intelligent genome probing for precise viral features classification. So far, AI/ML research on SARS-CoV-2 has permeated four key areas of medicine and healthcare, namely, screening and treatment [12] [13] [14] [15] , contact tracing 16 , prediction and forecasting 17, 18 , and drugs and vaccine discovery [19] [20] [21] .

To understand the origin and structure of SARS-CoV-2, a sequence of the viral genetic material is required. Sequencing viral genomes is performed to identify regions of similarity that may have consequences for functional, structural, or evolutionary associations 22 . Furthermore, it can reveal the possibility of future health risks and vaccine remedies. Phylogenetic tree and genomic tree (also referred to as hierarchical clustering) are common determinants for representing genetic diversity and evolutionary relationships of sequenced genomes. While phylogenetic tree reflects slow evolution within the genome (point mutations), hierarchical clustering describes major genetic re-arrangement events (insertions or deletions). Converting massive amount of complete genome sequences into meaningful biological representations has limited progress of discovering viral sub-strains and detailed transmission routes. Although numerous algorithms/tools have evolved to target specific gene sites/ locations for "on-the-fly" online phylogeny representations, incomplete representation and clustering errors abound-as different genome sites undergo different evolutionary changes, resulting in disparate multi-dimensional patterns at different sites. Attempts at estimating phylogenies by comparing entire genomes have been made by focusing mainly on gene content and gene order comparisons. While early attempts concentrated on morphological characters with the premise that direct genes comparison makes more sense, modern attempts use sequences from homologous genes 22 but are burdened by the fact that a gene's evolutionary history may differ from the evolutionary history of the organism, as some genes sufficiently conserved across the species of interest may escape detection. Alignment-free genome comparison methods are therefore becoming popular 22, 23 and have evolved to crash the heavy computational requirements of traditional alignment-based methods. Randhawa et al. 24 for instance proposed an alignment-free approach based on ML, for fast, inexpensive, and taxonomic classification of complete COVID-19 genomes in real time.

Variants of SARS-CoV-2 have emerged with reported new peaks of infection. A variant is a strain when it has a different characteristic. Variants with few mutations belong to the same lineage. Lineages are important for showing how a virus spreads through communities or populations. Interestingly, the less virulent strains are disappearing while those showing significant mutant variations prevail. A few documented cases of the spread of the viral sub-strains are observed based on locations, as follows: In USA, 4 sub-strains and 11 top mutations were discovered from the analysis of 12,754 complete SARS-CoV-2 genome sequences, where 2 out of 4 discovered sub-strains were potentially more infectious 25 . These sub-strains and 5 mutants were first detected in China, Singapore, and the United Kingdom 26 . In England, a sub-strain of replicative advantage was also discovered as variant of SARS-CoV-2, characterized by 9 spike protein mutations consisting of 3 deletions and 6 substitutions 27 . Some of these variants were prevalent in Netherlands, Switzerland, and France. In Southwestern Wisconsin, Southeastern Minnesota, Northeast Iowa, the sequencing of whole viral genomes of COVID-19 positive patients showed the spread of sub-strains to individuals in 13 cities from epicenters of the infection 28 . However, no viral sub-strain was observed in China 5 .

Vaccine types are also being circulated with several conspiracy theories and disbeliefs about the virus existence spreading across the globe. There is fear that emerging sub-strain variants may confer resistance to antibody neutralization, as evolving variants of concern are rapidly growing lineage to SARS-CoV-2 with high replicable mutants that may hinder the efficiency of existing vaccines and expand in response to the increasing after-infection or vaccine-induced seroprevalence 27 . Currently, most COVID-19 vaccines target the viral spike protein.

Although mutations may reduce their efficacy, they do not obliterate their effects. Inactivated virus vaccines that target the whole virus have been developed in China, as the immune responses they induce target more than a single part of the spike protein; hence, inducing several protective immune responses and instilling redundancy in the protective immune responses.

Mining additional knowledge from clinical data would assist complete features extraction, missing information recovery, hidden patterns understanding, and facilitate output targets labeling. Most biotechnology/ bioinformatics tools are 'black boxes ' and not open to contributions from the research community including reproducible research. Furthermore, extracted features are incomplete to aid meaningful knowledge integration. To support the growing field of medical-and bio-informatics, this paper adopted a novel approach to genome sequence mining. Transitions in nucleotide (dinucleotide) and changes in gene (mutation) information were exploited as input features or predictors, as these features have direct connection with the behavior of the virus. A hierarchical agglomerative clustering method was applied on the extracted features to detect optimal natural clusters for determining the evolutionary group of the various isolates, across countries. Using a self-organizing map (SOM), genome patterns with low similarity profile (or highly variable genomes) including the reference genome, were discerned to visually establish which sub-strain group(s) the various genome samples or isolates belong. By decoupling the SOM map through correlation hunting, a cognitive map that associates similar isolate clusters was obtained. The generated patterns and isolate similarity information provided details for enriching the input dataset through a supervised labelling of the classification targets. Statistical analysis validated the www.nature.com/scientificreports/ variability of the SARS-CoV-2 isolates. This research has therefore made substantial contributions to knowledge, as it provides the following:

(i) Useful Intermediate Results As opposed to most biotechnology and bioinformatic tools, useful intermediate results are produced in this paper to give further insights into the prevalence and transmission of SARS-CoV-2. The research is also replicable, as the algorithms and data are available to reproduce and validate our results. (ii) Support for the Contact Tracing of Undocumented Source of Infection Tracing infectious diseases routes for efficient documentation of infected cases is very crucial in emerging pandemic situations. While the excavated data holds only few traces of transmission history, our pattern clustering and cognitive knowledge mining results groups the various isolates into sub-strain clusters. This information is then used to label the output targets for classification and prediction, hence, providing understanding of which of the viral sub-strain(s) maintain(s) the reference genome pattern or is/are spreading within a particular country or been acquired from a different country. Furthermore, pattern progressions indicating emerging cluster transitions are revealed by the self-organizing map deployed in this study. (iii) Intelligent System Framework From labelled classification targets, accurate sub-strain classification and prediction is achieved. The proposed framework combines machine learning techniques and cognitive knowledge mining to extract dinucleotide and mutation frequencies for base variant analysis. Also, hidden sub-strains interactions between nucleotide sequences and other information not hitherto seen in the raw data are uncovered. (iv) Gender-Specific Isolates Mining To engage meaningful research in SARS-CoV-2, characterization and prediction by gender is crucial. This aspect which is often missing in the literature was excavated from GSAID. A metadata of excavated SARS-CoV-2 genomes by gender is available (Data S7: SupplData7. xlsx). The metadata permits the intelligent mining of SARS-CoV-2 demographic information, as ambiguities in annotation labels inherent in the Global initiative on Sharing All Influenza Data (GISAID) database (https:// www. gisaid. org/) have been resolved in this paper. Input features and classification target labels of unique isolates based on SOM cluster analysis and cognitive knowledge mining are also available (Data S8: SupplData8.xlsx). These resources can be integrated into expert decision-making systems to support early contact tracing and global disease surveillance.

Several studies have dwelled on the characterization of SARS-CoV-2 genome for tracing the evolution, strains, and diversity of the virus. In Tang et al. 9 , for instance, a population genetic analysis of 103 SARS-CoV-2 genomes was performed. Their analysis revealed two dominant types of SARS-CoV-2 namely the L type (~ 70%) and S type (~ 30%). In another study, Stefanelli et al. 7 Application of ML in the combat of COVID-19 has inspired new discoveries as well as improved methods based on experience of previous/related epidemic. Familiar areas of application center around medical imaging, disease tracing, epidemiology modeling and medicine (analysis of protein structure and drug discovery) and virulent nature of the virus. Whereas the processing of input data for informed decision support is necessary, the types of data exploited in the case of SARS-CoV-2 and related pandemic are mainly demographic and/or (control or clinical data) contributed by patients/volunteers around the world. Table 1 presents a summary of works carried out on ML/AI in related areas of application, indicating the objective, number of isolates collected and data source, methods, results/findings, and drawbacks. From the related works, we observe the following: (i) Most of the works explore hybrid tools that combine biotechnology and ML/AI methodologies, which have advanced precision in approach and solution to the pandemic. (ii) While 50% of the works rely on limited genomic evidence, others are mainly simulation studies. (iii) The fulcrum of most of the works revolve around characterization and forecasting with comparative analysis of SARS-CoV-2 evolution, and relationship between it and (other) related viruses. (iv) All the works are silent on the gender dimension. (v) None of these works to the best of our knowledge has engaged the possibility of SARS-CoV-2 sub-strains discovery.

The abundance of repetitive DNA in human genome assembly has introduced huge gap of multi-megabase heterochromatic regions that challenges standard mapping and assembly algorithms. Consequently, the composition of the sequence and potential functions of these regions have largely remained unexplored. Furthermore, existing genome tools cannot readily engage complete genome analysis to predict complex details and reveal hidden patterns, essential to offer explanations to the increased diversity of viral diseases. This work is therefore motivated by the existing gap between scientific knowledge and clinical application. Despite current advancement in state-of-the-art predictions, application of personalized genomics into clinical practice is yet to flourish. By 

The general workflow describing the proposed hybrid computational framework is presented in Fig. 1 , and the sequence of steps implementing the workflow is given on Supplementary Table S1. In addition to describing the steps, a visual demonstration of the implementation is incorporated.

Base variant analysis. Dinucleotide transitions and nucleotide mutations were computed for male and female isolates and averaged across the various continents namely Africa (Data S1: SupplData1.xlsx), Asia (Data S2: SupplData2.xlsx), Europe (Data S3: SupplData3.xlsx), North America (Data S4: SupplData4.xlsx), South America (Data S5: SupplData5.xlsx), and Oceania (Data S6: SupplData6.xslx). We analyze the average base transitions and mutations, and how they influence the overall behavior of the datasets.

Dinucleotide transitions. Averages of dinucleotide transitions of SARS-CoV-2 genomes computed across the various continents are presented in Fig. 2 . These transitions are represented as quadrilaterals dissected along its diagonals. Wang et al. 45 found that the SARS-CoV-2 reference genome has 29.94% of A, 32.08% of T, 19.61% of G and 18.37% of C. Hence, the expected dinucleotide transitions proportion is the product of the two nucleotide bases. For instance, the CG dinucleotide in the viral genome is 3.60% (i.e., 19.61% × 18.37%). From this, we arrive at the following computations for the respective dinucleotides/features identified in this study: AA = 8.96%; CC = 3.37%; GG = 3.84%; TT = 10.29%; AC = 5.50%; AG = 5.87%; AT = 9.60%; CG = 3.60%; CT = 5.87%; GT = 6.29%; TG = 6.29%; TC = 5.87%; TA = 9.60%; GC = 3.60%; GA = 5.87%; and CA = 5.50%. Our results corroborate Wang et al. 45 www.nature.com/scientificreports/ Figure 1 . Workflow describing the proposed hybrid approach. The workflow begins with the excavation of FASTA files of human SARS-CoV-2 genome sequences from GISAID. These files were stripped and processed into a genome database (DB) as multiple columns of nucleotide sequence. AI/ML techniques were then applied to extract knowledge from the genome datasets as follows: Using ML techniques, compute dis(similarities) scores between the various pairs of genome sequences and obtain a genomic tree of highly dis(similar) isolates grouped in the form of a dendrogram/phylogenetic tree. Determine the optimal number of natural clusters-to provide additional knowledge for supervised learning. Separate the viral sub-strains using SOM component planes-for possible transmission pathway/pattern visualization. Perform nucleotide alignment of the entire genome sequences (owing to varying sequence lengths of the different genome isolates, a cutoff at the last nucleotide of the genome isolates or the reference genome serves as the maximum pair for comparison), remove duplicate columns while imposing a similarity threshold-to yield unique genome sequences. Extract genome features by computing dinucleotide transitions and mutation frequencies. Generate cognitive map-for intelligent sub-strains prediction. Label classification targets of extracted features using derived SOM clusters and cognitive map. Learn and predict new/emerging sub-strains using ANN with k-fold validation method. Nucleotide mutations. Mutations in base pairs are important for understanding the pathogenicity of SARS-CoV-2. These computations were compiled after direct pairwise comparisons with the reference genome, averaged across the various continents, to produce Fig. 3 . As expected, changes in base pairs were observed after pairwise comparisons. Also, genome sequences with very negligeable changes or (no significant mutations) from the reference genome were noticed across the various continents for male and female isolates (see Table 3 ). Overall, total insignificant mutants of 587, representing 14.98% of the total number of isolates was observed for male patients, while female patients showed 258 insignificant mutants, representing 9.06% of the total number of isolates.

Average nucleotide mutations variant. In an analysis of SARS-CoV-2 mutations in the United States, CT mutant variants were found to have strong gender dependence 22 . Observed mutation variants between male and female isolates (M-F) computed from Fig. 3 , across the various continents are shown in Table 4 . Positive values indicate male frequency dominance while negative values indicate female frequency dominance. Hierarchical clustering analysis (agglomerative nesting: AGNES). Li et al. 46 investigated the angiotensin-converting enzyme 2 (ACE2)-the receptor agent for the SARS-CoV-2 virus-a known contributor to viral infections susceptibility and/or resistance 47 . ACE2 generates small proteins by cutting up larger protein angiotensinogen, in turn affecting the nucleotide/protein. They compared ACE2 expression levels across 31 normal human tissues between males and females and between younger and older persons using two-sided student's t-test. By examining the expression patterns, they found that protein expression levels were similarly expressed between males and females or between younger and older persons in experimented tissues. Furthermore, men showed worse prognosis than women. Their findings however lacked experimental and clinical data validation.

Using clinical evidence, we provide results of hierarchical clustering analysis to examine the arrangement of the nucleotide (protein) sequences/clusters across the entire genome through mutant accumulation, for male and female patients. Three distance measures were experimented, the ward, complete and single methods. The ward method had the highest agglomerative coefficient of (male = 0.9746; female = 0.9683), indicating more compact clusters; closely followed by complete (male = 0.9579; female = 0.9523); average (male = 0.9423; female = 0.9445); and single (male = 0.8710; female = 0.9058) methods.

To determine if differences exist in the genome sequences between genders, an independent t-test was imposed on the AGNES dis(similarity) scores. Results showed that male patients had statistically insignificantly longer genome sequences (0.9726 ± 0.0377) compared to female patients (0.9673 ± 0.0344), t(3280) = 1.710 , p = 0.0871 . However, there was no statistically significant difference in mean similarity between the nucleotide (protein) structures of the two groups at 95% confidence interval, hence, no significant genetic variations were observed. This result therefore corroborates the findings in Li et al. 43 and validates the claim that no significant genetic variation exists in human SARS-CoV-2 genomes for both groups.

Genome pattern analysis. Component planes reveal the distribution of single feature values on a SOM map. They permit an investigation of continents that share similar variant(s) or sub-strain(s) of SARS-CoV-2 and which variant permeates the different regions. Each component plane expresses the genome pattern of an isolate, where similar nucleotides are placed closely together in a 2D space. Hence, the patterns are established based on accumulation of nucleotides rather than individual nucleotide changes. To account for the variability in SOM neighborhood structure at every SOM run, the reference genome was included as part of the experiment datasets during each training phase. Hence, 4 reference genome pattern possibilities were generated to establish the very topology suitable for the trained datasets.

Our topologies possess random (but controllable) discontinuities that permit more flexible self-organization with high-dimensional data, thus, preserving as much as possible, the map structure. The SOM training was Asia  4  1  2  4  2  3  3  0  3  3  4  2  4  2  2  2   Europe  2  0  0  1  1  1  1  0  1  1  1  1  1  0  0  1   North America  9  2  2  10  5  5  7  1  6  5  6  4  7  3  5  5 South 3   Oceania  7  1  2  1  1  2  2  3  2  3  1  4  2 Table 5 . Cluster 1 represents the reference genome. Clusters 2, 3, 4, 5 and 6 are inter-continent pattern clusters or sub-strain(s). Cluster 7 indicates discovered intra-country pattern clusters or sub-strains. The analysis of Wang's et al. 22 suggests the presence of four sub-strains in the United States. Our results therefore sustain an increase in sub-strains within the various continents and offer explanations for the growing concerns and next wave(s) of the virus. www.nature.com/scientificreports/ A distribution of discovered clusters (7 in this case) by gender, across the various continents under study, is presented on Table 5 . Notice that cluster 7 has the highest proportion of data points, indicating increased intra-country transmissions; save North America, where cluster 3 has the highest proportion of data points, an indication of increased inter-country transmissions. A further analysis across the continents reveals that the African, Asian, and South American isolates clustered around sub-strains G1, G2 and G5 (where G represents a generic/general sub-strain) with number of isolates and cluster proportions for male and female patients distributed as follows: Due to paucity of data, the Oceanian isolates have data for only cluster 1: 2 (M = 24.86%, F = 18.95%). Table 6 summarizes the clusters distribution, by gender across the various continents.

Cognitive knowledge generation. While mutations are expected, there is need to initiate robust surveillance mechanism for continuous monitoring of public health implications and rapid response to new strains of COVID-19. To intelligently predict the viral sub-strains for both genders, novel cognitive maps that preserves chains of similar isolates were generated from the SOM component planes using the Python programming lan- www.nature.com/scientificreports/ guage. The extracted clusters are necessary for supervised labeling of the classification targets. By disassembling the SOM correlation hunting matrix space, we attribute these associations to disparate classes of discovered viral sub-strains. The outcome are cognitive maps with 7 clusters simulating the discovered SOM patterns and countries/isolates linked to these patterns for male and female patients (Supplementary Table S3 ). Each sub-strain cluster holds similar isolates that belong to a related pattern bounded by certain degree of association or correlation range, established by the SOM, and captures all isolates discovered within this range. We also captured from the SOM component planes any progression in patterns showing sub-strain(s) development leading to well separated cluster image(s). The cognitive knowledge would assist early contact tracing of cases in emerging disease situations as well as establish how the reference genome has evolved over time. This additional knowledge also permits further characterization of the viral sub-strains, as our results allow unique SARS-CoV-2 base pairs sequence identification (which do not appear in other viral sub-strains) and could be useful as baselines for designing new primers that permit further insights and examination by experts. where, x i is a nucleotide transition or mutation feature, min(X j ) and max(X j ) are the minimum and maximum means obtained from means of the respective nucleotide transitions or mutations feature dataset. The obtained scaling prevents zero values, hence, yielding an even spread of the datasets. Next, using the k-means algorithm, via Silhouette criterion, 7 cluster groups were assigned to the records. These groups or clusters provided information for relabeling the cluster column of both datasets and constructing the output classification targets for (1) 98.5900 ± 0.7600 0.0500 ± 0.0200 0.0100 ± 0.00 0.9900 ± 0.0300 0.9700 ± 0.0400 1.00 ± 0.00 5 98.5900 ± 0.7600 0.0500 ± 0.0200 0.0100 ± 0.00 0.9900 ± 0.0300 0.9700 ± 0.0400 1.00 ± 0.00 10 98.5900 ± 0.7600 0.0500 ± 0.0200 0.0100 ± 0.00 0.9900 ± 0.0300 0.9700 ± 0.0400 1.00 ± 0.00 15 98.5900 ± 0.7600 0.0500 ± 0.0200 0.0100 ± 0.00 0.9900 ± 0.0300 0.9700 ± 0.0400 1.00 ± 0.00 The performance of the NN model was evaluated on the normalized, labelled datasets, using the following metrics: Classification Accuracy, Root Mean Squared Error (RMSE), Mean Absolute Error, Precision, Recall and Area Under the Curve (AUC). The metric specific result from each dataset compared using paired t-test, depict no statistically significant difference between the male and female features (p > 0.05) at the 0.05 level of significance. Results obtained on Tables 6 and 7 confirm the suitability of ANNs in predicting COVID-19 sub-strains for male and female patients, respectively. Furthermore, perfect accuracies with an AUC of 1 were obtained for k = 3, 5, 10 and 15 folds.

Using the mean squared error (MSE) function, the NN performance plots yielded best validation performance, with RMSE values of the different k-folds, for male and female datasets, derived as follows k = 3: M = 0.06807 (Fig. 9a) , F = 0.0672 (Fig. 9b) ; k = 5: M = 0.0601 (Fig. 9c) , F = 0.0587 (Fig. 9d) ; k = 10: M = 0.0573 (Fig. 9e) , F = 0.0521 (Fig. 9f) ; k = 15: M = 0.0430 (Fig. 9g) , F = 0.0310 (Fig. 9h) . These results indicate improved classification errors as the number of validation folds increase.

A receiver operation characteristics curve (ROC) windows showing the training, validation, test, and all ROC, with k = 3, 5, 10 and 15, for male and female patients are given in Fig. 10a and b, respectively. The deployed model is helpful for classifying new datasets and for building expert support systems for efficient SARS-CoV-2 sub-strains discrimination.

On Table 8 , a summary of important performance metrics extracted from the literature for ANN with or without cross validation method, is presented to enable a comparison of our approach with state-of-the-art. We observe that the proposed approach performed better with very high classification accuracy, precision, and recall rates, indicating good generalization and correct prediction.

AI-based Big Data analytics are offering promising applications through the processing of large and complex datasets. In clinical diagnostics, for instance, image processing and computer vision are revolutionizing imagebased diagnosis. In the field of genetics, large-scale genomic research is poised to improve care through genotype definitions of other organisms. The increased availability of multiscale, multimodal, longitudinal patient datasets has provided exclusive opportunities for individualized medicine by permitting the visualization of different patient dimensions. Although this is widely believed to enhance the performance of predictive algorithms for near-clinical practice, these data are highly unstructured and require further refinements to enable structured access and intelligent features combination.

The future of individualized medicine has however imposed limitations, challenges, and biases, as machine learning models are typically sensitive to selection biases (i.e., under-or over-represented specific patient subgroups in the training cohort, including under-explored ethical considerations), and have contributed to stiffening successful deployment of AI in medical applications, particularly those utilizing human genetics and genome datasets. Although addressing underrepresented data in training datasets can resolve bias, while model retraining can assist in improving performance; confusable symptoms relative to the disease have posed a major bottleneck for future applications. This work has created a foundation for future studies on emerging infectious diseases by investigating the variation and functions of SARS-CoV-2 genomes for possible discovery of patterns exhibited by human isolates. A novel taxonomy was created to permit intelligent features mining. The case of symptomatic and asymptomatic patients also presents inconsistencies and is inconclusive in this paper. This aspect of infectious disease demands further research efforts on prompt detection of asymptomatic cases. A major limitation of this research is that some SOM pattern clusters were still confused and demands a defuzzification of these clusters using robust neuro-fuzzification tools.

Data source and genome sequences selection. Publicly available datasets of coronavirus cases around the globe deposited between December 2019 and January 15, 2021 were excavated from GISAID (https:// gisaid. org-a database of SARS-CoV-2 partial and complete genome compilations distributed by clinicians and researchers, the world over). Using the EpiCoV query interface of GISAID, complete genome sequences with patient status information (gender and age) were filtered. We observed that not all the excavated isolates met this criterion. Hence, out of about 70,000 entries, 8864 isolates (5130 male samples, and 3734 female samples) from different countries of the world contained at least the gender information, and were collected and processed, across 6 continents, Antarctica exempt (as no deposit of SARS-CoV-2 data was found as at the time of excavation). Age range of 1 month and 107 years were collected. Complete genome lengths of above 29,000 bp with < 1% undefined or ambiguous bases ('N's) or with high coverage unambiguous bases or nucleotides, were 98.5900 ± 0.7600 0.0500 ± 0.0100 0.00 ± 0.00 0.9900 ± 0.0100 1.00 ± 0.01 1.00 ± 0.00 5 98.6100 ± 0.7000 0.0500 ± 0.0100 0.0100 ± 0.00 0.9900 ± 0.0300 1.00 ± 0.01 1.00 ± 0.00 10 98.6100 ± 0.7000 0.0500 ± 0.0100 0.00 ± 0.00 0.9900 ± 0.0100 1.00 ± 0.01 1.00 ± 0.00 15 98.6100 ± 0.7000 0.0500 ± 0.0100 0.00 ± 0.00 0.9900 ± 0.0100 1.00 ± 0.01 1.00 ± 0.00 Table 9 documents the continent, isolate distribution by country, isolate distribution by gender, and total isolates excavated. Metadata on the extracted genome sequences consisting of the following columns (Isolate Code: Country + isolate number, Country, Accession Number, Gender, Age, Status, Specimen source and Additional Fast-all (FASTA) files of the genome isolates can be located at GISAID using the Accession Number. Specimen sources include swabs (nasal, oral, throat, nasal and oral); fluids (bronchoalveolar lavage, saliva, sputum, stool) and unknown. We observed that the GSAID database was inconsistent in rendering the patient status, as numerous incoherent annotations introduced inherent redundancy. To assist efficient documentation and processing of data, a taxonomy re-classifying the patient status is given in Fig. 11 . This taxonomy subsumes the incoherent annotations (annotations in square text boxes) into unique specifications (annotations in oval shapes), for intelligent data mining 48 .

The presence of ambiguous nucleotides may potentially mask the genomic signature encoded within nucleotide frequencies. Although sequencing errors in the form of ambiguous nucleotides (e.g., strings of letter "N") were noticed in the datasets, the affected nucleotide positions were ignored during preprocessing, such that the nucleotide positions maintained their current position and did not shift. A total genome sequence size of ( 8864 × 29000 − 8864 × 30165)bps = (257, 056, 000 − 267, 382, 560)bps was excavated, processed, and stored in comma separated value (CSV) file. Table 10 documents patient status statistics for symptomatic and asymptomatic cases. As observed, there are more hospitalized cases (7580) compared to non-hospitalized cases (391), with more male patients, hospitalized (M = 4318, F = 3262). Furthermore, more males died of COVID-19 than females (M = 541, F = 248). Asymptomatic cases however represent (37/5130; 0.72%) and (41/3734; 1.10%) of the total male and female isolates, respectively.

Configuration of computing device. An HP laptop 15-bs1xx with up to 1 TB storage running on Windows 10 Pro Version 10.018326 Build 18,362 was used for processing the excavated genome sequences, algorithms/programs, and other ancillary data. The system has an installed memory (RAM) of 16 GB with the following processor configuration: 1.60 GHz, 1801 MHz, 4 Core(s) and 8 logical processors. Although our system performed satisfactorily and produced the desired results, higher system configurations would improve the computational speedup.

Hierarchical agglomerative clustering (HAC). The dataset is configured with observations (nucleotides) represented in rows, while columns are variables (genome sequences ordered by countries). The number of columns corresponds to selected countries while the sequences have varying lengths. The data table is further converted into as.matrix format where all values of raster layers objects have columns for each layer and rows for each cells with numeric (continuous) values. In order to make the variables comparable through the elimination of arbitrary variable units, they are transformed (standardized) such that they have mean of zero and standard deviation of unity 49 , using Eq. (2).

where sd(x) represents the standard deviation of the feature values.

The procedure for implementing the HAC are as follows: Compute all the pairwise similarities (distances) between observations in the dataset and represent the result as a matrix. The resultant matrix is square and symmetric with diagonal members defined as unity-the measure of similarity between an element and itself. The matrix elements are computed by iterating over each element and calculating its (dis)similarity to every other element. Suppose A is a similarity matrix of size N × N , and B , a set of N elements. A ij is the similarity between elements B i and B j using a specified criterion (Euclidean distance, squared Euclidean distance, manhattan distance, maximum distance, Mahalanobis distance, cosine similarity). The selected criterion however depends on Table 9 . Distribution of excavated isolates. Oceania Guam (2), New Zealand (2), Australia (14) 12 6 18 Total: Number of countries excavated per continent: Africa (14) , Europe (28) , Asia (28) , South America (7), North America (8) 

After computing the distance between every pair of observation point, the result is stored in a distance matrix. Then, (i) every point is put in its own cluster (i.e., the initial number of clusters corresponds to the number of variables); (ii) the closest pairs of points are merged based on the distances from the distance matrix as the number of clusters reduces by 1; (iii) the distance between the new cluster and the previous ones is recomputed and stored in a new distance matrix; (iv) steps (ii) and (iii) are repeated until all the clusters are merged into one single cluster. The distance separating the clusters is specified via linkage methods 49 which includes, complete, average, single, and ward. Complete linkage computes the similarities and uses the maximum distance between clusters for merging while calculating cluster distances and adopting minimum inter-cluster distance merging. The average linkage calculates the average distance between groups of genome sequence before merging; while the total within-cluster variance is minimized with ward's method and the pair of clusters with minimum betweencluster distance are merged. We rely on all the four assessment techniques and adopt the distance measure with the highest agglomerative coefficient for cluster formation. The resultant cluster solution is finally visualized as a tree structure called a dendrogram (or phylogenomic) tree. As the tree is traversed upwards, observations that are similar to each other are combined into branches, which are themselves fused at a higher height. The height of the fusion provided on the vertical axis, indicates the (dis)similarity between two observations. The higher the height of the fusion, the less similar the observations are. Figure 12 show cluster plots and genomic plots generated using the ward minimum variance criterion.

Optimal natural clusters selection. While there are natural structural entities in some datasets that provide information on the number of clusters or classes, others including the dataset containing genome sequences are structured without boundaries. Cluster validation (an unsupervised methodology aimed at unravelling the actual count of clusters that best describes a dataset without any priori class knowledge) is therefore essential. In this paper, three widely used criteria to validate the number of clusters in the genome sequence dataset of these widely used criteria namely, silhouette, elbow 50 , and gap-statistics are discussed. The three criteria aim at minimizing the total intra-cluster variation (total within-cluster sum of square) as given in Eq. (3) .

where c k is the kth cluster, and, w(c k ) is the within-cluster variation. The total within-cluster sum of squares (wss) measures the compactness of the clustering solution. The following steps are applied to achieve the optimal clusters: (i) Compute using clustering algorithm (e.g., k-means clustering) for different values of k ; by varying k for a range of cluster values. (ii) For each k , calculate wss. (iii) plot the curve of wss according to the number of clusters k . (iv) the location of a bend (knee) in the plot is generally considered as an indicator of the appropriate number of clusters. Silhouette criterion is used to validate the clustering solution using pair-wise difference between the within-cluster distances, and by maximizing the value of this index to arrive at the optimal cluster number 51 . Elbow criterion plots the variance resulting from plotting the explained variation as a function of the number of clusters and picking the elbow of the curve as the number of clusters to use. Gap-statistics compares the total intra-cluster variation for different values of k with their expected values under null reference distribution of the data. The reference dataset is generated using Monte Carlo simulations of the sampling process.

In this paper, the k-means algorithm 52 is implemented using the R script consisting of R functions for the silhouette, elbow, and gap-statistics. The clustering solution can be visualized using the fviz_cluster function in R, to group the extracted genome sequences and finally represent the groupings in a tree format using dendrogram. As a preprocessing step to study the phylogeny of the genome isolates, the HCA or AGNES plots as shown in Africa  599  1039 97  63  1  0  0  0  0  0  0  0  0  0  2  0  0  0  2  1  0  0   Asia  1737 728  623  327  29  16  37  25  0  0  0  0  5  1  5  2  0  0  182  61  0  0   Europe  441  436  34  31  35  43  122  109  32  21  35  17  4  6  1  0  25  19  37  26  32  33   North America 165  123  96  61  2  0  0  0  0  0  0  0  4  3  0  0  159  120  173  62  4  6   South America 100  109  68  66  27  29  0  0  0  0  0  0  7  1  0  0  33  33  147  98  1  2   Oceania  1  2  0  0  9  4  0  0  0  0  0  0  0  0  0  0  2  0  0  0  0  0   Total:  3043 2437 918  548  103  92  159  134  32  21  35  17  20  11  8  2  219  172  541  248  37 The frequencies of the dinucleotide transitions are obtained by accumulating each dinucleotide along the extracted genome sequences. We ignore ambiguous nucleotides absent in the reference genome. Suppose we have n total genome length. By allowing a single sliding iteration window there exists n − 1 bubble counts. Hence, the dinucleotide frequencies of d i can be obtained by counting all nucleotides that correspond to i.

Nucleotide mutation frequency. Several techniques for biological sequence alignment (multiple or pairwise) have flourished the literature 54 and are continually being refined, but most of these techniques suffer from the lack of accuracy and partial interpretations. A direct pairwise alignment of each nucleotide with the reference genome was achieved by computing the recurrence of mutated nucleotides down the sequence line. For this study, the sequence of established SARS-CoV-2 reference genome (NC_045512; 29903 bp) sequenced in December 2019 was used. Suppose n represents the total length of a genome; By permitting a single sliding iteration window, a mutation may be any of the following pair:

If we denote the frequency of the ith nucleotide pair as p i , then, genomic sequence pairs with 12-dimensional feature vector in the form of Eq. (7) are possible, Unsupervised genome clustering. Several mathematical techniques have been deployed for identifying underlying patterns in complex data. These techniques, which cluster data points differently in multidimensional space are important to discover fundamental patterns of gene expression inherent in data. The clustering technique adopted in this paper is the SOM and has been used extensively in the field of bioinformatics, for visual inspection of biological processes, genes pattern expressions-as maps of (input) component planes analysis. SOM is a neural-network that projects data into a low-dimensional space 55 , by accepting a set of input data and then mapping the data onto neurons of a 2D grid (see Fig. 13 ). The SOM algorithm locates a winning neuron, its adjusting weights, and neighboring neurons. Using an unsupervised, competitive learning process, SOMs produce a low-dimensional, discretized representation of the input space of training samples, known as the feature map. During training, weights of the winning neuron and neurons in a predefined neighborhood are adjusted towards the input vector using Eq. where L is the total number of neurons in the network. The input nodes have p features, and the output nodes, q prototypes, with each prototype connected to all features. The weight vector of the connections consumes the prototype of each neuron and has same dimension as the input vector. SOMs differ from other artificial neural networks as they apply competitive learning, against error correction learning such as backpropagation, and the fact that they preserve the topological properties of the input space using a neighborhood function. where r is the learning rate and f (i, q) is the neighborhood function, with value 1 at the winning neuron q ; and decreases as the distance between i and q increases. At the end, the principal features of the input data are retained, hence, making SOM a dimension reduction technique. The batch unsupervised weight/bias algorithm of MATLAB (trainbu) with mean squared error (MSE) performance evaluation, was adopted to drive the proposed SOM. This algorithm trains a network with weight and bias learning rules using batch updates. The training was carried out in two phases: a rough training with large (initial) neighborhood radius and large (initial) learning rate, followed by a finetuned training phase with smaller radius and learning rate. The rough training phase can span any number of iterations depending on the capacity of the processing device. In this paper, we kept the number of iterations at 200 with initial and final neighborhood radius of 5 and 2, respectively, in addition to a learning rate in the range of 0.5 and 0.1. The fine training phase also had a maximum of 200 epochs, and a fixed learning rate of 0.2. Selection of best centroids of the genome feature within each cluster was based on the Euclidean distance criterion. The algorithm configures output vectors into a topological presentation of the original multi-dimensional data, producing a SOM in which individuals with similar features are mapped to the same map unit or nearby units, thereby creating smooth transition of related genome sequences to unrelated genome sequences over the entire map.

Genome sequence transformation and low similarity profile selection. Each genome sequence is mapped into an equivalent genomic signal (a discrete numeric sequence) using the following individual nucleotide encoding (i.e., A = 1; C = 2; G = 3; T = 4). Nucleotide pairs above 29,000 bp is maintained in this paper as base input vector, indicating approximate (maximum) length of DNA sequences of the raw SARS-CoV-2 genome. Next, repeated sequences are removed using a Microsoft Excel macro that deletes duplicate columns. A Microsoft Excel macro implementing this process is found on Supplementary Table S2 Cognitive knowledge extraction. Knowledge mining has served huge benefits for quick learning from big data. We apply Natural Language Processing of the genome datasets to extract knowledge of similar strains of the virus. A simple iteration technique is imposed on the SOM isolates ( i = 1, 2, 3, . . . , n) , where n is the maximum number of isolates, as follows: For each isolate pattern, compile similar patterns with the rest of the isolates (i.e., i + 1, i + 2, . . . , n) . Concatenate compiled isolate(s) into a list ( j 1 , j 2 ,…, j m ) where j is an element of the list. Dump the compiled list into CogMap(k i ∈ j 1 , j 2 ,…, j m ) . As the distance matrix is extremely high-dimensional, suitable representative sequences of the isolate clusters are decoupled into a cognitive map for labeling of the classification targets.

Neural network design. Although five core Artificial Neural Networks (ANN) areas have been explored, namely: Multi-Layer Perceptron, Radial Basis Network, Recurrent Neural Networks, Generative Adversarial Networks, and Convolutional Neural Networks; this paper adopts the Multi-Layer Perceptron model (MLP)a class of feedforward ANNs, with at least three layers of nodes: an input layer, a hidden layer, and an output layer (Fig. 14) . Except for the input nodes, each node is a neuron that uses a nonlinear activation function. MLP utilizes a supervised learning technique called backpropagation for training. Our output classes or classification targets (C1-C7) are derived from pattern clusters discovered from learning the SOM. A k-fold cross-validation method is adopted to divide the data into k parts. At each iteration i , the ith fold is used for testing, while the other folds are used for training. In this paper, the number of groups is split (into k parts) such that each data sample spans 3, 5, 10 and 15 yielding 60, 100, 200 and 300 calls, respectively, on the training and testing mode of each dataset. The k-fold cross validation method is known to estimate the robustness of the model on new data and is used to drive the validation phase of the NN. As the model is fit on training data, a more realistic estimate of how well the model prediction will work on new cases is obtained. In the current experimental setup, twenty (20) runs of stratified k-fold cross validation 57 is performed on the male and female datasets using a Neural Network (NN) model developed in the MATLAB2017b. Figure 14 . ANN architecture. A 3-layered network, with one output layer and one hidden layer. The input layer consumes the knowledge-enriched genome datasets comprising of extracted patterns of SOM learning of the respective genome isolates and additional knowledge sieved from analysis of the genome sequences (i.e., number of natural clusters discovered from the genomic tree, discovered SOM sub-strain clusters, and link sequences derived from cognitive maps of the various isolates).

Corona virus: Global pandemic causing world-wide shutdown

COVID-19: Towards controlling of a pandemic

No evidence for increased transmissibility from recurrent mutations in SARS-CoV-2

Mutations strengthened SARS-CoV-2 infectivity

Emergence of drift variants that may affect COVID-19 vaccine development and antibody treatment

Factors affecting COVID-19 infected and death rates inform lockdown-related policymaking

Whole genome and phylogenetic analysis of two SARS-CoV-2 strains isolated in Italy in

A novel coronavirus from patients with pneumonia in China

On the origin and continuing evolution of SARS-CoV-2

The emergence of commercial genomics: analysis of the rise of a biotechnology subsector during the Human Genome Project

Long walk to genomics: History and current approaches to genome sequencing and assembly

Application of deep learning technique to manage COVID-19 in routine clinical practice using CT images: Results of 10 convolutional neural networks

Automated detection of COVID-19 cases using deep neural networks with X-ray images

Combination of four clinical indicators predicts the severe/critical symptom of patients infected COVID-19

Rapid and accurate identification of COVID-19 infection through machine learning based on clinical available blood test results

Tracing Tracker: A flood of coronavirus apps are tracking us. Now it's time to keep track of them

Short-term forecasting COVID-19 cumulative confirmed cases: Perspectives for Brazil

An interpretable mortality prediction model for COVID-19 patients

Artificial intelligence approach fighting COVID-19 with repurposing drugs

Predicting commercially available antiviral drugs that may act on the novel coronavirus (SARS-CoV-2) through a drug-target interaction deep learning model

Déjà vu: Stimulating open drug discovery for SARS-CoV-2

Alignment-free sequence comparison: Benefits, applications, and tools

Alignment-free sequence comparison-a review

Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID19 case study

Analysis of SARS-CoV-2 mutations in the United States suggests presence of four substrains and novel variants

Characterizing SARS-CoV-2 mutations in the United States

L18F substrain of SARS-CoV-2 VOC-202012/01 is rapidly spreading in England

Interregional SARS-CoV-2 spread from a single introduction outbreak in a meat-packing plant in northeast Iowa

Genomics of Indian SARS-CoV-2: Implications in genetic diversity, possible origin and spread of virus

Machine learning based approaches for detecting COVID-19 using clinical text data

Analysis of spatial spread relationships of coronavirus (COVID-19) pandemic in the world using self organizing maps

The Humanitarian Data Exchange (HDX)

Multiple ensemble neural network models with fuzzy response aggregation for predicting COVID-19 time series: the case of Mexico

Forecasting of COVID-19 time series for countries in the world based on a hybrid approach combining the fractal dimension and fuzzy logic

Design of specific primer sets for the detection of variants of SARS-CoV-2 using artificial intelligence

Accurate identification of sars-cov-2 from viral genome sequences using deep learning

Analysis of SARS-CoV-2 RNA-sequences by interpretable machine learning models

Analyzing hCov genome sequences: applying machine intelligence and beyond

Modeling COVID-19 epidemic in Heilongjiang Province

Machine learning techniques for sequence-based prediction of viral-host interactions between SARS-CoV-2 and human proteins

A SARS-CoV-2 protein interaction map reveals targets for drug repurposing

Classification of COVID-19 and other pathogenic sequences: A dinucleotide frequency and machine learning approach

Human SARS-CoV-2 has evolved to reduce CG dinucleotide in its open reading frames

Expression of the SARS-CoV-2 cell receptor gene ACE2 in a wide variety of human tissues

Structural variations in human ACE2 may influence its binding with SARS-CoV-2 spike protein

Mining the human metabolome for precision oncology research

Visual association analytics approach to predictive modelling of students' academic performance

A hybrid machine learning approach for flood risk assessment and classification

Fuzzy clustering of students' data repository for at-risks students' identification and monitoring

Unsupervised mining of under-resourced speech corpora for tone features classification

Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome

TranslatorX: multiple alignment of nucleotide sequences guided by amino acid translations

Variants of self-organizing maps

Hunting for correlations in data using the self-organizing map

Cross-validation pitfalls when selecting and assessing regression and classification models

E. 1 , provided literature materials, performed critical review as well as data validation. U.G.I., contributed to the research methodology, framework/tools design, preparation of figures, implementation, and interpretation of results. F-M.U., structurally edited the original draft and contributed to the software design component and implementation

The authors appreciate all the anonymous reviewers whose valuable comments have greatly improved the quality of this paper.

The authors declare no competing interests.

Supplementary Information The online version contains supplementary material available at https:// doi. org/ 10. 1038/ s41598-021-93757-w.Correspondence and requests for materials should be addressed to M.E.E.

Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.