key: cord-1033816-7k38w0kj
authors: El Boujnouni, Hamoucha; Rahouti, Mohamed; El Boujnouni, Mohamed
title: Identification of SARS-CoV-2 origin: using Ngrams, Principal Component Analysis and Random Forest Algorithm
date: 2021-04-20
journal: Inform Med Unlocked
DOI: 10.1016/j.imu.2021.100577
sha: e8ee47769f55a06cd43719949db5aac63a2a26a5
doc_id: 1033816
cord_uid: 7k38w0kj

COVID-19 is an infectious disease caused by the newly discovered SARS-CoV-2 virus. This virus causes a respiratory tract infection, symptoms include dry cough, fever, tiredness and in more severe cases, breathing difficulty. SARS-CoV-2 is an extremely contagious virus that is spreading rapidly all over the world and the scientific community is working tirelessly to find an effective treatment. This paper aims to determine the origin of this virus by comparing its nucleic acid sequence with all members of the coronaviridae family. This study uses a new approach based on the combination of three powerful techniques which are: Ngrams (For text categorization), Principal Component Analysis (For dimensionality reduction) and Random Forest algorithm (For supervised classification). The experimental results have shown that a large set of SARS-CoV-2 genomes, collected from different locations around the world, present significant similarities to those found in pangolins. This finding confirms some previous results obtained by other methods, which also suggest that pangolins should be considered as possible hosts in the emergence of the new coronavirus.

A novel member of human coronavirus, newly identified in Wuhan, China, in late December 2019, officially named as SARS-CoV-2 (Severe Acute Respiratory Syndrome COronaVirus 2) by the International Committee on Taxonomy of Viruses [1] . It is a new strain of RNA viruses that has not been previously identified in humans. The disease caused by this virus was named COVID-19 (COronaVIrus Disease 2019) by the World Health Organization. Its most common symptoms are fever [2] , tiredness [3] , and dry cough [4] . Some patients may have aches and pains [4] , nasal congestion, runny nose [5] , sore throat, diarrhea [4] , or loss of taste or smell [5] . The disease can spread through the respiratory droplets produced when an infected person coughs or sneezes [6] . These droplets land on objects and surfaces around the person. Other people may acquire SARS-CoV-2 by touching these contaminated objects or surfaces, then touching their eyes, nose, or mouth [6] .

Several scientists around the world have carried out researches to fight against this virus, in particular by determining its origin(s), symptoms, causes, diagnosis, treatment, etc. Understanding the origin of this virus is a very important issue not only to identify the causes of this pandemic and to avoid future ones, but it has serious consequences on our interactions with ecosystems, on breeding of wild and domestic animals, on some laboratory practices, etc. Concerning the origin of this virus, many research papers have been published since the appearance of this pandemic. These studies, which were based on theoretical and experimental approaches, have shown various origins of SARS-COV-2. For example, Paraskevis et al. [7] found that the new corona virus is J o u r n a l P r e -p r o o f most closely related with the BatCoV RaTG13 detected in bats. In their study they used various methods and software: RDP4, Simplot v3.5.1 and phylogenetic analysis with maximum likelihood and Bayesian methods. The same result was found by Zhou et al. [8] , who showed that SARS-COV-2 is 96% identical at the whole-genome level to a bat coronavirus. They used an experimental approach based on virus isolation, cell infection, electron microscopy and neutralization assay, followed by RNA extraction and PCR (Polymerase Chain Reaction), serological test, examination of ACE2 receptor for 2019-nCoV infection. Also, High-throughput sequencing, pathogen screening and genome assembly and phylogenetic analysis were used. Another study was carried out by Luan et al. [9] and suggested that Bovidae and Cricetidae should be included in the screening of intermediate hosts for SARS-CoV-2. Their working method started by sequence analysis of ACE2 followed by structure simulation of ACE2-RBD complex using SWISS-MODEL online server and Chimera software version 1.14. In another research, Qiu et al. [10] [12] applied the sequencing of pangolins genomes followed by phylogenetic and recombination analyses. Zhang et al. [13] used genome assembly and gene prediction followed by phylogeny. and Han [14] applied phylogenitic analysis of RBD. In another research paper written by Shi et al. [15] who performed an experimental study composed of four successive steps (i) isolation of SARS-CoV-2 (ii) inoculation of SARS-CoV-2 to (ferrets, cats, dogs, pigs, chickens and ducks) (iii) quantification by PCR in different organ and tissues and finally detection of antibody against SARS-CoV-2 by ELISA and neutralization assay. The authors showed that SARS-CoV-2 replicates poorly in dogs, pigs, chickens, and ducks, but ferrets and cats are permissive to infection which can act as intermediate hosts of this virus. This paper presents a new method to ascertain the origin of SARS-CoV-2 by comparing its genomes with other viruses from Coronaviridae family [16] . The analysis is performed through three powerful techniques from machine learning and data mining that work successively: The first one is Ngrams [17]; its role is to extract relevant information from a given genome sequence and to present it in an exploitable numerical form. The second one is principal component analysis [18] , it reduces the dimensionality of the extracted information by projecting them onto a lower-dimensional space, and the last one is Random Forest (RF) algorithm [19] that will be used to classify the reduced information and to find biological homology between different sequences.

In language modeling, Ngrams [17] are sequences of characters or words extracted from a text. It can be divided into two categories: Character based and Word based. The former is a set of N consecutive characters extracted from a word. The main motivation behind this approach is that similar words will have a high proportion of Ngrams in common. The second is a set of consecutive words extracted from text. Word level Ngrams models are quite robust for modeling language statistically as well as for information retrieval. Ngrams has been applied in numerous medical and biological fields such as: Protein classification [20] , Prediction of J o u r n a l P r e -p r o o f human immunodeficiency [21] , interpreting the Hidden Information of DNA Sequences [22] , etc. Table 1 shows the result of applying Ngrams on a random sequence "TGATTTGATGCAA" with different values of 

As is known, the possible number of nitrogenous bases in a given DNA or RNA is four (A, G, C, T or A, G, C, U).Statistically, Ngrams with will give possible 3-grams and when equals to 6 the total possible 6-grams grows exponentially and becomes , and if we further increase the value of to then the number of 9-grams reaches .This rapid increase will slow down the performance of RF (next process) because this classifier must separate objects with very large numbers of attributes. The solution is to transform this large set of attributes into a smaller one that still contains most of the information. This task can be performed by a well-known technique called principal component analysis abbreviated with the acronym (PCA). It's a commonly used method for dimensionality reduction invented by Karl Pearson [18] . It projects each data point onto only the first few principal components to obtain lower-dimensional data while preserving as much of the data's variation as possible . Figure 1 describes the basic steps needed to apply PCA to a given dataset that contains samples { }. 

Among the panoply of algorithms used in machine learning we chose RF classifier to identify the origin of COVID19. This machine learning technique was first introduced by Leo Breiman [19] who was inspired by the work of Amit and Geman [23] . It is a robust algorithm that consists of a large number of individual decision trees that operate as an ensemble. A decision tree is a non-parametric supervised learning method used for classification and regression. It is a flowchart similar to a tree structure, where each internal node represents a test on an attribute, each branch represents a result of the test, and each terminal node holds a class label. The advantage of RF algorithm is a prediction by committee which is more accurate and stable than that of any individual tree. Akin to a decision tree, RF can be used for classification (the output is the mode of the predictions), regression (the output is the mean of the predictions). In bioinformatics, RF is used intensively in many fields such as: Chronological age prediction based on DNA [24] , Identification of species based on DNA barcode [25] , outcome prediction in oesophageal cancer [26] . Figure 2 shows an example of a RF that consists of decision trees used for a classification task. It can be seen from this figure that each tree produces an individual decision (Class A or B), then RF merges them together to get a more accurate and stable prediction. 

This study was conducted on a large dataset that consists of two parts: The former contains the complete coronaviridae nucleic acid sequences represented with genomes belonging to species. The second includes SARS-CoV-2 genomes collected from homo-sapiens from different locations around the world.

The genomes used in this study correspond to the firsts RNA sequences of SARS-COV-2 collected before any mutations were found. Both parts have been downloaded from the National Center for Biotechnology

Information [16] . A detailed description of the dataset used to identify SARS-CoV-2 is given in Table 2 .

J o u r n a l P r e -p r o o f 

The experimental protocol was performed through five steps as shown in figure 3 . The first step is gathering which aims to collect the genomes of all Coronaviridae viruses. The second step is a preprocessing of the set of genomes previously collected. This step starts by extracting Ngrams and their frequencies from each genome using different values of .The values are chosen in the interval { }, an example of the extraction was described in table 1. The Ngrams that are shared between different genomes will form a common basis. Next, the Ngrams representation of each genome will be expressed on this common basis. The third step is a dimensionality reduction; it begins by normalizing the dataset, then it uses principal component analysis to reduce its dimension. In this experiment, the cumulative explained variance was chosen to be less than 95%. The fourth step is an automatic learning process in which Coronaviridae family (except SARS-COV-2) will be used to train RF algorithm and to tune its hyper-parameters [27] . The last step uses the best model of RF to identify the origin of 10313 of SARS-CoV-2 samples. Figure 3 describes in detail the procedure followed to identify SARS-CoV-2. The experiment was conducted using R as a programming language and five packages namely: "seqinr" to read the nucleic acid sequences of different viruses stored in fasta format, "ngram" to construct the vectors of Ngrams from different genomes, "stats" to reduce the dimension of the vectors of Ngrams using PCA, "randomForest" to learn the dataset of coronaviridae with RF and to find the possible origin(s) of SARS-CoV-2 and "graphics" to represent graphically the statistical results. The experiments are conducted on a computer with an i5-7200U CPU @ 2.50GHz (4CPUs) having 12 GB of RAM.

J o u r n a l P r e -p r o o f Table 3 shows an example of the application of Ngrams with on SARS-COV-2. It can be seen that, when , the sequence of this virus is represented with attributes that are the occurrences of all possible 2-grams that can be found in SARS-COV-2 genome (e.g "AA","TG", etc.). Also, it can be noted that, the occurrence of a given 2-grams is very high. This can be explained by the fact that it is highly probable to find a sequence consists of only two specific nucleotides in a given genome. When increases { } the size of the vectors that represent the genomes grows and the occurrences become lower. However, the description of the nucleic acid sequence becomes more accurate. Table 4 shows the result of the application of Ngrams followed by PCA on the same SARS-COV-2 genome represented with Ngrams in Table 3 . It can be seen that PCA has successfully reduces the number of attributes obtained by Ngrams which will ease the task of RF.

J o u r n a l P r e -p r o o f 

The identification of SARS-CoV-2 origin with our method requires a preliminary step in which the optimal values of the hyperparameters of RF algorithm must be determined [27] . These values correspond to the maximum of classification accuracy of coronaviridae members ( samples). In this training step, a grid search space for tuning those hyperparameters that takes into account the influence of the number of N-grams and the effect of PCA was performed. The hyperparameters concerned are:

 ntree: The number of trees in RF algorithm, it is obvious that larger number of trees produce more stable models, but require more memory and a longer run time. This hyperparameter will be varied in the interval { }.

 mtry: Number of features (predictors) randomly sampled as candidates at each split, its default value is the round of the square root of the number of features [27] . To be more careful, mtry will be varied in an interval constructed by three values that are: (square root of total number of all predictors), (half of this square root value), and (twice of the square root value). Table 5 shows a sensitivity analysis on the controlling hyperparameters of RF algorithm taking into account the effects of in N-grams and the use or not of PCA. For each combination of parameters, the table gives the accuracy of RF to learn the coronaviridae family (except SARS-COV-2). The classification accuracy is evaluated using fold cross validation with [28] . It means that the dataset was divided randomly into ten subsets. Each part was used as a testing subset for the RF trained on other nine subsets. The average of the error terms obtained will be taken. This process will be repeated times and the final result will be reaveraged, the table also shows a crucial information which is the run-time required to search for the best hyperparameters of RF in a given intervals of ntree and mtry.

From table 5, it can be seen that the best accuracy is always achieved when ntree equals which corresponds to the maximum value in its interval of variation { }. It's an expected result because when the number of trees increases the classification results become more accurate (many trees participate in the collective decision). However, this promising result is not without downsides as it is clearly visible on the run-time column level. Besides, it can be observed that the best accuracy doesn't correspond constantly to the value by default of mtry. Consequently, it was justifiable to vary mtry in an interval. Moreover, it can be highlighted that the accuracy increases when the parameter of Ngrams ( ) increases. This can be explained by the fact that Ngrams with a large will extract more information from the sequence of genomes. But, this satisfactory result isn't without cost in terms of run-time which increases with (ntree and mtry are constant). Additionally, the effect of PCA is very remarkable. It allowed the reduction of the number of the features extracted from each genome while achieving comparable results in terms of accuracy to those found with all features (without PCA).Also, the use of PCA allows to reduce the run-time needed to tune the hyperparameters of RF.

In the next step, the best hyper-parameters found using the grid search will be used to build a robust RF to identify the origin of SARS-CoV-2.

J o u r n a l P r e -p r o o f Coronavirus HKU1 (9) . When or RF gives only one origin to samples of SARS-CoV-2 which is pangolin. This can be considered as an improvement if we assume that a one virus cannot have more than a one origin. Also, because PCA reduces the dimensionality of each genome (e.g Table 3 Vs Table 4 ) it will also reduce the run-time required to predict the origin of SARS-COV-2 with RF. In conclusion, the use of Ngrams with high values of followed by PCA and RF helps to better identify the origin of SARS-CoV-2 in a limited time. 

In this paper, a new method based one three powerful techniques that operate successfully were proposed to identify SARS-CoV-2's origin. The former is Ngram, it is widely used in text categorization; it creates feature vectors from different nucleic acid sequences. The second is principal component analysis; it's used to cleverly reduce the large quantity of information generated by Ngrams while maintaining as much of the data's variation as possible. The last technique is a supervised machine learning algorithm called RF that is proposed to classify these reduced vectors and to detect similarities between the genome of SARS-CoV-2 and other coronaviridae viruses. This experiment conducted on a large dataset of genomes using our method has shown that SARS-CoV-2 was originated from pangolins. This study confirms some previous findings that indicate that pangolins were the culprits in COVID-19 transmission to humans and refutes others that suggest other zoonotic origins like bats, ferrets, cats, etc.

On the origin and continuing evolution of SARS-CoV-2

COVID-19 presenting as acute pancreatitis

COVID-19 in Latin America: Symptoms, Morbidities, and Gastrointestinal Manifestations

Early Clinical and CT Manifestations of Coronavirus Disease 2019 (COVID-19) Pneumonia

COVID-19 and anosmia: A review based on up-to-date knowledge

Coughs and Sneezes: Their Role in Transmission of Respiratory Viral Infections, Including SARS-CoV-2

Full-genome evolutionary analysis of the novel corona virus(2019-nCoV) rejects the hypothesis of emergence as a result of a recent recombination event

A pneumonia outbreak associated with a new coronavirus of probable bat origin

SARS-CoV-2 spike protein favors ACE2 from Bovidae and Cricetidae

Predicting the angiotensin converting enzyme 2 (ACE2) utilizing capability as the receptor of SARS-CoV-2. Microbes and infection

Evidence of recombination in coronaviruses implicatingpangolinorigins of nCoV-2019

Identifying SARS-CoV-2 related coronaviruses in Malayan pangolins

Probable Pangolin Origin of SARS-CoV-2 Associated with the COVID-19

Pangolins Harbor SARS-CoV-2-Related Coronaviruses

Susceptibility of ferrets, cats, dogs, and other domesticated animals to SARS coronavirus 2

N-gram-based text categorization

On Lines and Planes of Closest Fit to Systems of Points in Space

Random Forests

Protein classification using modified ngrams and skip-grams

Prediction of human immunodeficiency virus type 1 drug resistance: Representation of targetsequence mutational patterns via an n-grams approach

Classifying Promoters by Interpreting the Hidden Information of DNA Sequences via Deep Learning and Combination of Continuous Fast Text N-Grams

Shape quantization and recognition with randomized trees

Chronological age prediction based on DNA methylation: Massive parallel sequencing and random forest regression

Identification of species based on DNA barcode using k-mer feature vectorand Random forest classifier

Feature selection for outcome prediction in oesophageal cancer using geneticalgorithm and random forest classifier

Hyperparameters and tuning strategies for random forest, WIREs Data Mining Knowl Discov

A Study of CrossValidation and Bootstrap for Accuracy Estimation and Model Selection

The authors state that there is no conflict of interest.