key: cord-0925465-pbqaqbde
authors: Zhang, Zhengjun
title: The Existence of at Least Three Genomic Signature Patterns and at Least Seven Subtypes of COVID-19 and the End of the Disease
date: 2022-05-11
journal: Vaccines (Basel)
DOI: 10.3390/vaccines10050761
sha: 890a1c11ecaca824bd490f904ce3c850cfefbf6f
doc_id: 925465
cord_uid: pbqaqbde

Hoping to find genomic clues linked to COVID-19 and end the pandemic has driven scientists’ tremendous efforts to try all kinds of research. Signs of progress have been achieved but are still limited. This paper intends to prove the existence of at least three genomic signature patterns and at least seven subtypes of COVID-19 driven by five critical genes (the smallest subset of genes) using three blood-sampled datasets. These signatures and subtypes provide crucial genomic information in COVID-19 diagnosis (including ICU patients), research focuses, and treatment methods. Unlike existing approaches focused on gene fold-changes and pathways, gene-gene nonlinear and competing interactions are the driving forces in finding the signature patterns and subtypes. Furthermore, the method leads to high accuracy with hospitalized patients, showing biological and mathematical equivalences between COVID-19 status and the signature patterns and a methodological advantage over other methods that cannot lead to high accuracy. As a result, as new biomarkers, the new findings and genomic clues can be much more informative than other findings for interpreting biological mechanisms, developing the second (third) generation of vaccines, antiviral drugs, and treatment methods, and eventually bringing new hopes of an end to the pandemic.

Since the virus SARS-CoV-2 was first reported in December 2019, numerous research results related to the virus and COVID-19 disease have been published. Scientists have put tremendous effort into trying to find genomic clues linked to COVID-19. However, knowledge of COVID-19 is still limited, and many problems have remained unanswered [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] . As a result, many published results have not guaranteed convincing accuracy. With an exception, a data science study by reported five critical genes, and their combined effects can accurately classify COVID-19 patients and COVID-19 free patients into their respective groups and further classify COVID-19 patients into seven subtypes [2] .

The analytical method used in this study has been proven a powerful approach in earlier studies where breast cancer patients, colorectal cancer patients, and lung cancer patients are again classified into their respective groups (nearly) without errors among eleven different study cohorts with thousands of patients [2, [13] [14] [15] . For example, in a breast cancer study, eight widely known genes-BRCA1, BRCA2, PALB2, BARD1, RAD51C, RAD51D, ATM-were proved to have low efficacy in terms of diagnosis [13] . In addition, a colorectal cancer study showed CXCL8 alone could predict early-stage colorectal cancer accurately [14] . It was surprising that such a unique feature has been missed in the literature using other existing analytical methods.

This paper intends to prove the existence of at least three genomic signature patterns and at least seven subtypes of COVID-19 driven by five critical genes (the smallest subset of genes). For this purpose, we are going to advance further the signature patterns found

This methodological section briefly introduces max-linear competing factor classifiers for self-contained due to different data structures used in this work compared with other research, whose details for data structures can be found in the papers [2, [13] [14] [15] .

Suppose Y i is the ith individual patient's COVID-19 status (Y i = 0 as not infected, Y i = 1 for infected) and X

ip , k = 1, . . . , K, are the gene expression values with p = 19, 472 genes in this study. Here k stands for the kth type of gene expression level drawn based on K different biological sampling methodologies. Note that most published works set K = 1, and hence, the superscript (k) can be dropped from the predictors. In this paper, K = 2 (TPM and EC). Using a logit link (or other monotone links), we can model the risk probability p (k) i of the ith person's infection status as:

alternatively, we write

i is a 1 × p observed vector, and β (k) is a p × 1 coefficient vector which characterizes the contribution of each predictor (gene in this study) to the risk.

Considering there have been several variants of SARS-CoV-2 and multiple symptoms (subtypes) of COVID-19 diseases, it is natural to assume that the genomic structures of all subtypes could be different. Suppose that all subtypes of COVID-19 diseases may be related to G groups of genes

, . . . , X (k) i,j g j , j = 1, . . . , G, g j ≥ 0, k = 1, . . . , K

where i is the ith individual in the sample, and g j is the number of genes in jth group. Note that we do not use the widely used gene pathways in our newly developed machine learning algorithm. It is possible that blind pursuit of gene pathways may lead to wrong directions and lose chances of finding the scientific truth. Instead, our methodological approach will automatically find the smallest numbers of G and g j to reach 100% accuracy, and as a result, better interpretations can be achieved. The competing (risk) factor classifier is defined as

where β

ij is a 1 × g j observed vector, β (k) j is a g j × 1 coefficient vector which characterizes the contribution of each predictor in the j group to the risk.

i.e., all components compete to take the most significant effect.

Taking β (k) 0j = −∞, j = 2, . . . , G, (3) is reduced to the classical logistic regression, i.e., the classical logistic regression is a special case of the new classifier. Compared with black-box machine learning methods (e.g., random forest, deep learning (convolutional) neural networks (DNN, CNN)) and regression tree methods, each competing risk factor in (3) forms a clear, explicit, and interpretable signature with the selected genes. The number of factors corresponds to the number of signatures, i.e., G. This model can be a bridge between linear models and more advanced machine learning methods (black box) models. However, (3) remains the desired properties of interpretability, computability, predictability, and stability. Note that this remark is similar to Remark 1 in Zhang (2021) [15] .

In practice, we have to choose a threshold probability value to decide a patient's class label. Following the general trend in the literature, we set the threshold to be 0.5. As such, if p (k) i ≤ 0.5, the ith individual is classified as disease-free; otherwise, the individual is classified to have the disease.

With the above-established notations, we introduce a new machine learning classifier, smallest subset and smallest number of signatures (S4), for K = 1 to K = 2 as follows.

is the index set of all genes, S j = j j1 , . . . , j j,g j , j = 1, . . . , G are index sets corresponding to (2), S u is the union of S j , j = 1, . . . , G , |S u | is the number of elements in S u , λ 1 ≥ 0 and λ 2 ≥ 0 are penalty parameters, andŜ = j j1 , . . . , j j,g j , j = 1, . . . ,Ĝ andĜ are the final gene set selected in the final classifiers and the number of final signatures.

Remark 3. The case of K = 1 corresponds to the classifier introduced in Zhang (2021) [15] . The case of K = 1 and λ 2 = 0 corresponds to the classifier introduced in Zhang (2021) [2] .

The following Proposition 1 mathematically proves the existence of desired solutions. The proof follows the lines in the work by extending K = 1 to K = 2 [15] .

) can reach is m. Then, for suitable choices of λ 1 ≥ 0 with λ 1 + |S u | > 0 and λ 2 ≥ 0, the new classifier S4 will lead to the smallest |S u | and the smallest number of G such that ∆ = m.

A perfect classifier (100% sensitivity and 100% specificity) will have m = 0 in Equation (4) , which is the case in our study. We note that only with m = 0 and the smallest subset of genes, the mathematical and biological equivalence between the disease and the selected genes can be established.

The COVID-19 data to be analyzed is publicly available as GSE157103: large-scale multi-omic analysis of COVID-19 severity, public on 29 August 2020 [1] . The experiment type is "Expression profiling by high throughput sequencing". One hundred and twenty-six samples were analyzed in total, with 100 COVID-19 hospitalized patients and 26 hospitalized non-COVID-19 patients. There are two types of datasets available. One type is TPM (transcripts per million), while another type is EC (expected counts). The prior analysis outcome of TPM data was reported in Zhang (2021) [2] . This new study targets EC data and makes comparisons to TPM data. We note that among 100 COVID-19 patients, 50 are ICU patients and 50 are non-ICU hospitalized patients. Among 26 COVID-19 free patients, 16 of them are ICU patients with other types (non-COVID-19) of disease, and 10 of them are non-ICU patients with other types of disease.

The third set of COVID-19 data to be analyzed is publicly available as GSE152418: RNAseq analysis of PBMCs in a group of 17 COVID-19 subjects and 17 healthy controls, public on 13 June 2020 and last updated on 20 May 2021 [16] . Illumina bcl2fastq v2.17.1.14 was used for demultiplexing. Reads were mapped to the human (GRCh38 Ensemble release 100) genomic reference with STAR (v2.7.3a) with default alignment parameters. Abundance estimation of raw read counts per transcript was done internally with STAR using the algorithm of htseq-count.

In the following subsections, we first focus on the first two datasets: TPM and EC.

Solving the optimization problem (4) among 19,472 genes with K = 2 (k = 1 for TPM data, and k = 0 for EC data) using the Monte Carlo search method, we obtain the following classifiers with five critical genes (ABCB6 (ATP binding cassette subfamily B member 6-Langereis blood group), KIAA1614 (uncharacterized protein), MND1 (meiotic nuclear divisions 1), SMG1 (nonsense mediated mRNA decay associated PI3K related kinase), RIPK3 (receptor interacting serine/threonine kinase 3)):

where Equation (5) is for TPM data which were first reported in Zhang (2021) [2] , while Equation (6) , i = I, II, III

and the risk probabilities based on all three component classifiers together are

Similarly, the risk probabilities calculated from EC are

, i = I, II, III (9) and the risk probabilities based on all three component classifiers together are P − (EC) = exp(CF(EC)) 1 + exp(CF(EC)) . Table 1 presents the performance (accuracy, sensitivity, specificity) of each individual classifier in Equations (5) and (6). From Table 1 , we can see that the coefficients of MND1 are different between CF-I and CF-II, and the signs of SMG1 are different. This phenomenon tells us that the function of each gene and its contribution to the risk probability depends on other factors (genes) in the combination, i.e., there are gene-gene interactions and gene-subtype interactions. We note that such interactions are hardly expressed in existing models, and as a result, the interpretation of other types of current models can be difficult. We also notice that the sensitivities of individual classifiers are low, and at first glance, we may think such models are not powerful enough to be used in practice. Given the combined classifiers' superior performance, we can immediately infer that the populations are heterogeneous, and as a result, a single model in the literature must have low performance in all samples. A combination of individual classifiers should lead to better performance. However, the number of genes can be large, and their practical value can be in doubt. Fortunately, our S4 classifiers will not suffer such problems. Tables 2 and 3 present some patients' gene expression values, competing classifier factors, and predicted probabilities defined in Equations (5)- (10) . For full tables, we refer to tables in a Supplementary Excel File. Tables 2 and 3 . Table 2 : visualization of gene-gene relationship and gene-risk probabilities. Note that 0.5 is the probability threshold. The left panel is for CF-I with the X-axis being MND1, Y-axis being KIAA1614, and Z-axis being SMG1. The risk probability bar is on the right. The middle panel is for CF-II with the X-axis being MND1, Y-axis being ABCB6, and Z-axis being SMG1. The risk probability bar is on the right. The right panel is for CF-III with the X-axis being RIPK3 and the Y-axis being the risk probability. Table 3 : visualization of gene-gene relationship and gene-risk probabilities. Note that 0.5 is the probability threshold. The left panel is for CF-I with the X-axis being MND1, Y-axis being KIAA1614, and Z-axis being SMG1. The risk probability bar is on the right. The middle panel is for CF-II with the X-axis being MND1, Y-axis being ABCB6, and Z-axis being SMG1. The risk probability bar is on the right. The right panel is for CF-III with the X-axis being RIPK3 and the Y-axis being the risk probability.

From Figures 1 and 2 , we can see clear patterns of how the patients are classified and how they are correlated with individual genes. For example, some patients can be classified by one gene, RIPK3 (the right panels in the figures), while some patients are classified by the combined effects of linear combinations of three genes (the left and middle panels).

As a result, these observations justify the existence of three genomic signature patterns, i.e., the three competing classifiers of COVID-19.

We can also see similar patterns between Figures 1 and 2 . This phenomenon is mainly due to the component genes and signs of coefficients in the classifiers CF-i-(TPM) and CF-i-(EC) being the same. In addition, the linear correlation coefficients between TMP and EC data for genes ABCB6, KIAA1614, MND1, RIPK3, and SMG1, are 0.87, 0.94, 0.95, 0.68, and 0.93, respectively, which supports the pattern similarity.

The same signs of coefficients in the classifiers CF-i-(TPM) and CF-i-(EC) reveal that these classifiers are robust to nonlinear transformations and the five genes are critical and COVID-19 specific (recall that the patients in the control group are also hospitalized patients with 16 of them being ICU patients).

The pairwise correlation coefficients among these five genes are presented in Table 4 . The table shows that TPM data and EC data show different gene-gene correlations. Even so, they still lead to the same accuracy. As a result, they can be used as cross-validation of the proposed S4 model (4) and the selected genes. Finally, we note that the classical cross-validation works under a homogeneous population, i.e., it does not apply to heterogeneous populations. (5) and (6), we can see that increasing ABCB6 and RIPK3 expression levels (TPM) and decreasing KIAA1614 and MND1 expression levels will help the patients. However, from Table 4 , we see that ABCB6 is positively correlated with KIAA1614 and MND1, and then, an increase in ABCB6 expression level may result in an increase in MND1 expression level and KIAA1614 expression level, which increases the COVID-19 competing risk CF-I. As a result, any efficient treatments of COVID-19 must consider the functional effects of the discovered genes.

Note that RIPK3 does not appear in the classifiers CF-I and CF-II, and the signs of SMG1 in CF-I and CF-II are different. As a result, a vaccine/antiviral drug/therapy which is based on the function of SMG1 (an mRNA gene, with positive and negative coefficient signs) in the CF-I (CF-II) may benefit the patients classified in the CF-I (CF-II). However, it may not have any effects on the patients in the CF-III. Furthermore, it may make the status of patients in the CF-II (CF-I) worse. Analogously, a vaccine/antiviral drug/therapy which is based on the function of RIPK3 (CF-III) may not be effective for the patients affected by the effects of CF-I and CF-II. The vaccines based on the current knowledge and technology have not used any messages from gene-gene interactions and gene-subtype interactions. Given the uncertainty and the unknown side effects of the first-generation vaccines, if the next generation vaccines can utilize the information discovered in the genomic signature patterns of COVID-19, it can be hoped the efficacy can be largely improved, especially for those mRNA type vaccines. As such, these observations reveal that we may need at least three different types of vaccines against COVID-19 subtypes (variants).

Note that these five critical genes have not been reported in any papers except Zhang (2021) [2] . They are not from any single gene pathway. Analogously, their functions may be described as a basketball team's combination effects. First, in a basketball team, there are five positions: center (C), power forward (PF), small forward (SF), point guard (PG), and shooting guard (SG). A combination of ABCB6-MND1-SMG1 (KIAA1614-MND1-SMG1) may be described as a driving force of a powerful PF-C-PG (SF-C-PG) combination in scoring, and RIPK3 is like a powerful SG. Second, the expression levels are comparable to the playing time of the players and their roles in the rotations competing against different opponents and their playing combinations. Third, the driving forces of winning games can be either one or two or all of the three combinations.

Note that the correlation coefficients among the five genes calculated from TPM (upper triangle) and EC (lower triangle) in Table 4 are different. This phenomenon can be explained by the nonlinear relationship between TPM and EC, within TPM and EC, and heterogeneous populations among patients, which is a perfect scenario for the proposed model in Section 2. In addition, note that due to 100% accuracy, metrics such as ROC, recall, and precision are also with 100% accuracy.

In Section 3.3, we saw that each signature could be used as a classifier given its 100% specificity. However, from Figures 1 and 2 , we see that some patients can only be classified by one particular signature classifier, some patients can only be classified by two competing classifiers, and some patients can be classified by any of the three competing classifiers. This observation shows that COVID-19 patients are heterogeneous, and their COVID-19 status can be further classified into subtypes using the classifier combinations. Figure 3 uses Venn diagrams to plot seven classified subtypes of 100 COVID-19 patients. In the figure, I-II means both CF-I and CF-II lead to the correct classification. All other combinations are interpreted similarly.

We first state that the more intersections of subareas, the more complicated the disease in a Venn diagram. For example, in the lower panel of the figure, we see that seven patients satisfy the classification conditions of all three competing classifiers. It turns out all of these seven patients are ICU COVID-19 patients. Using the lower panel as an example, we identify the ICU patients have the distribution CF-I (6), CF-II (8), CF-III (2), CFs-I-II (6), CFs-I-III (12), CFs-II-III (9), and CFs-I-II-III (7) and find that the disease severity (ICU) is positively correlated with the number of classifiers used. Based on this observation, we can conclude vaccines can benefit patients even if a particular type of vaccine only works for one signature pattern related to COVID-19 subtypes. On the other hand, if one particular type of vaccine can protect a particular COVID-19 subtype (or SARS-CoV-2 virus), this vaccine may not be effective for other COVID-19 subtypes (or SARS-CoV-2 viruses.) As a result, a fully vaccinated individual still has the risk of being infected. Furthermore, a recovered individual from an infected COVID-19 illness can get breakthrough infections again with other COVID-19 subtypes. Two recent papers report concerns about infection-enhancing anti-SARS-CoV-2 antibodies based on lab experiments [17, 18] . This phenomenon may be explained by our new findings due to the existence of three genomic signature patterns and seven subtypes of COVID-19. Taking the SMG1 gene as an example, an increase (or a decrease) of SMG1 expression levels that are good for one signature pattern of COVID-19 will be bad for another signature pattern.

Note that the top and lower panels classify some patients into different subtypes. This phenomenon can be explained. In Section 3.4, we calculated the linear correlation coefficients between TPM and EC for each gene and saw there are nonlinear relationships, which lead to different classifications but are still accurate. Given that both TPM and EC lead to accurate results, it is safe to say that the identified five genes can be truly critical in studying COVID-19.

The identified subtype information opens a new research direction, new drug developments, and new refined personalized therapies. For example, in the diagnosis stage, medical doctors can use the final model to predict their patient's COVID-19 status by calculating the risk and determining which of those seven groups the patient belongs to. In the treatment stage, those signature patterns can be used to study the effectiveness of drugs and treatments, i.e., conduct clinical trials based on classified groups. In the drug development stage, pharmaceutical companies can use the findings of critical genes to study new drugs. Finally, it can be hoped that mRNA-based therapies can be introduced using the critical genes' information in the therapy stage. 

From Section 3.3, we see that COVID-19 patients have higher combination expression values and COVID-19 free patients have lower combination expression values. With 100% sensitivity and 100% specificity, the competing classifiers derived in Section 3.3 build a biological equivalence to COVID-19 status. Equations (5) and (6) together with 100% sensitivity and 100% specificity reveal the hyperplanes formed by five genes, and their derived classifiers partition a five-dimension space into two subspaces (COVID-19 and COVID-19 free) in which there is a mathematical equivalence between COVID-19 and COVID-19 free. Here, the logic is from the fact of 100% sensitivity and 100% specificity, i.e., if A implies B, then not B implies not A, and if not A implies not B, then B implies A.

In Section 4.1, we will use hydraulic engineering of a reservoir dam with cracks to describe COVID-19 variants metaphorically. Figure 4 uses the five genes identified in Section 3.3 and one additional gene, CDC6 (to be discussed in Section 4), to describe a conceptual framework for COVID-19 disease and variants formation dynamics. In Figure 4 , there are four layers. The first (top) layer stands for SARS-CoV-2 viruses entering a human's interior body. The second layer shows the lung will be affected. The third layer describes gene-gene interaction signature patterns of COVID-19. The bottom layer is a conceptual illustration of a human genome sequence with the five critical genes placed on the sequence. In the figure, we use triangles to represent signature patterns (competing factor classifiers), with the genes on the nodes and the classifier number inside the center. RIPK3 is on its own as an absorbed triangle. There are two arrows to indicate the cause dynamics. With the triangles or RIPK3, the larger the triangle or the shape of RIPK3, the higher the severity of the COVID-19. Our conceptual idea is that after being infected with SARS-CoV-2 (top-down direction), the virus triggers the signature patterns to function, i.e., making the triangles (the shape) larger. However, simultaneously, the human's immune system starts to function (bottom-up direction), and the vaccine also starts to function; therefore, the areas of triangles (shape) can be reduced, or the edges of the triangles can be broken, i.e., two ways of fighting against the virus. Depending on which direction (infection or 'killing' the virus) is more effective, an infected individual may be fully recovered from COVID-19 disease or suffer much more severe COVID-19 symptoms.

On the other hand, the genomic signature patterns of a COVID-19 patient represent the advanced (deep level) gene-gene interactions. A change in the size of the triangle may trigger new variants to form and transmit to other individuals, i.e., an analog to the hydraulic engineering example in Section 4.1.

In this section, we present the distributions of clinic variables, sex, age, and ICU status, and seven subtypes, in Table 5 . Table 5 shows that ranges of age among all subtypes are similar, i.e., from 20 s to 80 s. More males were hospitalized than females. Patients with the subtype sub-I were more likely (6:9 vs. 8:17) to need ICU than those with the subtype sub-II. The reason why this could happen is not known. Comparing CF-I and CF-II, we see that KIAA1614 is an unknown gene, and the sign of SMG1 is negative in CF-I, which may be the causal factor. In addition, sub-V and sub-VI share similar features, and sub-V contains KIAA1614 and a negative coefficient sign of SMG1. We note that if all things are wrong, i.e., the sub-VII subtype, all patients were ICU patients. Therefore, it tells us the sub-I subtype is more puzzling, likely due to the unknown gene KIAA1614. Table 6 reports the fitted coefficient values for four critical genes and related sensitivities and specificities of competing risk classifiers using raw counts. Comparing Tables 1 and 6 , we can immediately see that Table 6 does not contain a CF-III classifier (RIPK3), and the coefficient signs of KIAA1614 in CF-I and SMG1 in CF-II are negative, which are different from their counterparts in Table 1 . This observation justifies that there are more than three genomic signature patterns and seven subtypes. Recall that the controls in the first and second datasets were hospitalized patients with non-COVID-19 diseases, while the control in this third dataset is healthy subjects. As a result, it is safe to conclude that the five genes, ABCB6, KIAA1614, MND1, RIPK3, and SMG1, are truly COVID-19 specific, and the newly proposed classification method is a powerful tool.

Note that the coefficients of MND1 in both Tables 1 and 6 are uniformly positive. This observation suggests that MND1 (meiotic nuclear divisions 1) may be the most important gene related to COVID-19. Table 7 lists gene expression values, competing classifiers' values among all patients in the third dataset. 

The proposed method is different from the current diagnostic methods in several ways. First, our new method (S4) theoretically leads to finding the smallest number of genes with clear signature patterns which are interpretable. Second, our method directly deals with heterogeneous populations and performs natural clustering and classifications of samples into their respective groups. Third, our proposed method can describe gene-gene interactions and gene-subtypes interactions.

Simultaneously observing the same set of five genes for two different datasets has not been reported in published literature. In our opinion, those published genes by many other studies are more like at the surface level (biologically directly related to the disease) based on the analysis methods used, and the new set of five genes in this work is at the deep level or the root level, where genes are not directly related to the diseases by the present biological knowledge. Furthermore, many reported key genes are based on their individualized expression value changes and significance, i.e., not based on genegene interactions. As a result, treatments are palliative, and the disease is difficult to cure. The findings in our new research are based on nonlinear and competing risk factors interactions, which is an advanced gene-gene interaction mechanism. Our proposition is that the biomedical discovery of new variants of COVID-19 is only the surface level of the virus (diseases). More profound, underlying "competing factors" of the virus need to be studied. Metaphorically, an expert in hydraulic engineering finds a dam with cracks and treats them on the surface. However, the reservoir has an interconnected water dynamic below the surface that will further impact other points of the dam. As a result, it will cause new cracks unless there is a fundamental treatment solution with the entire structure in mind. Similarly, scientists may observe the variants (rather passively) and develop vaccines in response to the variants. However, knowledge of the virus's deeper advanced genomic architecture that will systematically cause other mutations is limited. Traditional methods in statistics, machine learning, and AI are limited to understanding COVID-19 from surface-level observations. However, our innovative method has achieved significant results in identifying and understanding COVID-19 genomic signatures.

The newly identified genes and their combinations may be used as new biomarkers. In our opinion, traditional methods (e.g., PCR, serology) are directly associated with the disease symptoms, i.e., they do not provide pathological characteristics; they are a onefold indicator. On the other hand, the new classifier is a multifold indicator that can further divide the disease into subgroups (variants may be another word). In addition, the new classifier shows gene-gene interactions and advanced (or root) structures.

This work has verified that when all component classifiers simultaneously classify a group of patients as COVID-19-positive, these patients are ICU patients (Section 3.4), which definitely points out the advanced genomic structure of COVID-19 disease.

In the literature and the current practice, tremendous efforts have been made to study COVID-19 genomic sequences, variants and their impacts, and vaccine effectiveness. However, the progress on the pathological causes of COVID-19 and the functional effects of genes is still limited. In terms of computational medicine, our new work is the first to accurately define the functional effects of five critical genes and lead to the mathematical and biological equivalence between five genes and COVID-19 status. Furthermore, this paper introduces an advanced machine learning algorithm that identifies five essential genes, which further determine three genomic signature patterns and seven subtypes of COVID-19 with high accuracy. The final classifiers are expressed by explicit mathematical equations which are interpretable and can guide medical practice. In addition, new graphical diagnostic tools are introduced. Besides the striking advance in studying genomic signature patterns of COVID-19, our work also sheds new light on computational medicine, genetic studies, informatics, algorithm and machine learning, and statistics.

We realized readers would ask about the model overfitting and robustness. Please note that our model is fitted to three different datasets and has reached the highest accuracy. Each dataset has its heterogeneous patterns (subgroups). Datasets are measured at different scales. It is hard for the existing models to simultaneously fit such datasets and get satisfactory accuracy, not to mention the interpretability of the fitted models. Using three such datasets naturally serves as cross-validation and robust checking. It turns out our new approach is robust.

In many scenarios, a 100% accuracy may be thought of as "too good to be true". However, "too good to be true" may also be dangerous to use to guide science discovery and innovation. In many applied sciences, the truths can be simple but not straightforward. Complicating or aggravating the problem can mask the nature of the problem. Blindly insisting on "too good to be true" may miss ample opportunities of finding the truth. In contrast, we know it is hard to see the forest through the trees.

One may argue that the dataset we used in this analysis is not large enough as it has only 126 samples with 19,472 predictors in the first two datasets and 34 samples with 64,083 genes in the third dataset. It is, of course, preferable to have a large dataset. Nevertheless, we argue that the conclusions and inferences are trustworthy with a convincing accuracy on three different datasets that show nonlinear and heterogeneous relationships.

On the other hand, if an approach cannot gain a satisfactory performance with a small dataset, applying it to a large dataset can be a wrong strategy as it may lead to wrong or suboptimal conclusions.

A natural question is whether or not the high accuracy is by chance. Note that each of our component competing classifiers has reached 100% specificity, which may be true with a probability smaller than 1/2 26 = 1.0 × 10 −8 by chance. In addition, when all three signature patterns are satisfied, all classified patients are lab-confirmed ICU patients, which indicates it cannot be by chance.

We have used analogies to interpret our modeling strategy. We note that our proposed method is still in its primary stage of development, just started in 2021. Researchers are still not very familiar with the method. Once the proposed method becomes a standard method, the introduction of the method and analogies will no longer be needed.

As discussed in Sections 1 and 3, the three signature patterns and seven subtypes maintain the most important biological informatics of COVID-19. This set of genes is the only set that leads to accurately classifying hospitalized COVID-19 patients, including ICU patients, into their respective groups. Unless a different new discovery of other advanced structures of genes other than these five genes can be obtained, and such a new discovery (if exists) can fully explain the three signature patterns and seven subtypes discussed in this paper, these five critical genes and their derived three signature patterns and seven subtypes remain the most informative findings.

When a model is fitted to the whole dataset and leads to (nearly) perfect accuracy, it will uniformly work for partitioned data as long as the partition is balanced to all heterogeneous subgroups. This is the case in our analysis. On the other hand, we have not seen any published papers that used the "standard" procedure to lead to accurate prediction.

We note that multiplying a constant to Equations (5) and (6) will not change the classification results and the shapes in Figures 1 and 2 with a convincing accuracy being achieved. However, the color strengths will be changed. Such a phenomenon justifies the use of signatures to describe the advanced gene-gene interaction structures. Using this idea, we can express (6) 

Equation (11), after applying scale changes, shares similar signatures as in Equation (5). Considering the nonlinear relationships between TPM data and EC data, this observation again proves our proposed competing factor classifiers are robust.

Our findings can be used to develop precision test kits for testing COVID-19 and to evaluate the function and performance of already implemented vaccines, i.e., used as new antibody indexes. Interpretable and implementable formulas are given in the paper. After a COVID-19 case is confirmed, personalized treatments can be implemented. For example, increasing or decreasing levels of critical genes based on the identified COVID-19 subtype can be crucial to the patients' recovery. Using the relationship determined in the findings, antiviral drugs can be developed. Mathematically, the new objective function of Equation (4) is a mixture of combinatorial optimization and continuous optimization. It is a new type of classification benchmark. It is expected that this new classification formula will motivate research in statistics, computational mathematics, computer science, and many applied sciences. The findings can motivate many new types of research in COVID-19 studies and other studies, e.g., cancer studies. Many finished studies can re-start new investigations using the new methodology.

With the mathematical equivalence between Equations (6) and (11) , different numbers of patients in different subtypes and in the control will not change the classification results. With the specificities of each individual classifier (CF-i) reaching 100%, the signature patterns (not presented in Section 3.3) of patients in the control will be the same, i.e., just one type. This phenomenon shows that the identified five critical genes are COVID-19 specific. One can apply our method to study critical genes of disease types (also other respiratory diseases) in the control, i.e., assign the related samples in the treatment group and specify some other types of diseases or normal cases as the control.

We want to hypothesize that the discovered COVID-19 variants (alpha, beta, delta, lambda, mu, omicron, etc.) may be connected to different signature strengths in our discovered signature patterns. COVID-22 in Figure 4 means that SARS-CoV-2 variants in 2022 can be combinations of several subtypes (variants), i.e., they are no longer the same types as in 2019. Mathematically, hyperplanes in geometry formed by Equations (5) and (11) contain a subspace that can be further partitioned into subspaces. Therefore, we hypothesize that these variants may fall into separable subspaces. After obtaining new data with variants information, this hypothesis can be tested, or additional genes may be involved. For example, in the breast cancer study mentioned earlier, triple-negative breast cancer was accurately separated from other types of breast cancer using three genes identified by the S4 classifier. Furthermore, the discovered genomic signature patterns and COVID-19 subtypes are intrinsic no matter what variants have been identified or will be identified. Given our proven mathematical and biological equivalences, if these innate signature patterns and subtypes cannot be treated and fully studied now, they will cause trouble in the future again. In addition, with available data related to various variants, our study approach may be able to reveal the causes of higher transmission or mortality of specific variants.

Using classifiers CF-I, -II, and -III as new biomarkers, we can study other potential genes that are highly correlated with these biomarkers. For example, based on the first and second datasets, the most highly correlated five genes to each new biomarker lead to a total of fifteen genes: DBN1, LY6G6C, TMEM54, MTMR1, SNORC, ANP32E, ATAD2, SMC2, ZWILCH, SMC4, C6orf47, STRADA, LRSAM1, UNC93B1, and SASH3, which were listed in the Introduction.

Finally, in our analyses, we also found the gene CDC6 (cell division cycle 6) can be informative. Its combination with ZNF282 (zinc finger protein 282) can lead to 97.62% accuracy (98% sensitivity, 96.15% specificity), and its combination with both ZNF282 and CEP72 (centrosomal protein 72) can lead to 98.41% accuracy (99% sensitivity, 96.15% specificity). We found the high expression level of CDC6 increases risk while the high expression levels of ZNF282 and CEP72 decrease risk. Although they did not lead to the best accuracy as those five genes identified in Section 3 did, such a performance is already better than other published gene combinations. From the literature, the gene CDC6 is a protein essential for the initiation of DNA replication, while ZNF282 is known to bind U5RE (U5 repressive element) of HLTV-1 (human T cell leukemia virus type 1) with a repressive effect, but little is known of its expression and function otherwise. The gene CEP72 coded protein is localized to the centrosome, a non-membraneous organelle that functions as animal cells' major microtubule-organizing center. Zhang (2022) hypothesizes that CDC6 is a protein essential for the initiation of RNA replication of COVID-19 [19] .

Data used in this study are from hospitalized patients' blood samples. We are not sure whether or not the identified genes work for data sampled from those non-hospitalized COVID-19 patients or asymptomatic patients. Solving optimization problems (4) involves combinatorial optimization, integer programming, and continuous programming. The computational complexity is extremely high, and we have not figured out how to define the complexity. We used an extensive Monte Carlo search method to find the best solution. However, we cannot be sure whether additional sets of genes can also be the optimal solutions even if our finding of five genes is already the best (smallest) subset of genes with the desired accuracy in the study of COVID-19. Although we have established the mathematical and biological equivalences, we cannot tell our findings are the causes or the results. Although we have identified functional effects via gene-gene interactions and gene-subtype interactions of the five genes, we have not identified how genes interact with each other and their causal directions. We are working in this direction. As the proposed method is still at the early stage of development, just started in 2021, assumptions for future use of the results of this study have been applied. Finally, due to the lack of available new blood sampled data for new variants, it is difficult to infer the risks of variants. Furthermore, what we have obtained are computational results and it is not necessary the case that the results will follow the same pattern in vitro/in vivo or in clinics.

The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/vaccines10050761/s1, Tables S1 and S2 with complete patients' IDs and expression values corresponding to Tables 2 and 3 in the main text. Real data and computer outputs are in a supplementary file available online and submitted together with this paper. In addition, a MATLAB ® demo code for solving a final dataset example in Equation (4) (λ 2 = 0) is also available.

The author disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The project was partially supported by NSF grant DMS-2012298.

The study did not require IRB approval.

Data Availability Statement: The datasets are publicly available. The data links are stated in the Data Description Section.

Large-scale multi-omic analysis of COVID-19 severity

Five critical genes related to seven COVID-19 subtypes: A data science discovery

The quest to find genes that drive severe covid

COVID-19 Host Genetics Initiative. Mapping the human genetic architecture of COVID-19

Genetic mechanisms of critical illness in COVID-19

Dexamethasone in hospitalized patients with COVID-19

Development and validation of a clinical and genetic model for predicting risk of severe COVID-19

Inborn errors of type I ifn immunity in patients with life-threatening COVID-19

Autoantibodies against type I ifns in patients with life-threatening COVID-19

Failure to replicate the association of rare loss-of-function variants in type IIFN immunity genes with severe COVID-19

Genetic association analysis of SARS-CoV-2 infection in 455,838 UK biobank participants

Association of toll-like receptor 7 variants with life-threatening COVID-19 disease in males: Findings from a nested case-control study

Lift the veil of breast cancers using 4 or fewer critical genes. Cancer Inform

PSMC2 and CXCL8-Modulated Four Critical Gene-Based High Performance Biomarkers for Colorectal Cancer

Functional effects of four or fewer critical genes linked to lung cancers and new sub-types detected by a new machine learning classifier

Systems biological assessment of immunity to mild versus severe COVID-19 infection in humans

In vitro and in vivo functions of SARS-CoV-2 infection-enhancing and neutralizing antibodies

Infection-enhancing anti-SARS-CoV-2 antibodies recognize both the original Wuhan/D614G strain and Delta variants. A potential risk for mass vaccination?

ATP6V1B2 and IFI27 and their intrinsic functional genomic characteristics associated with SARS-CoV-2. bioRxiv 2022

The author thanks the academic editor and four anonymous referees for their critique and constructive comments, which greatly improved the presentation of the paper. The author is also thankful for the helpful information, discussions, and assistance from Jing Qin, Yuqing Xu, Hao Yang Teng, Ariel Qihui Tao, Jing Zhang, Steven Moen, Alex Hayes, and many other people who discussed the project with the author and encouraged the author to put the effort into this urgently needed study. Finally, the partial support from NSF-DMS2012298 is acknowledged.

The author declares no conflict of interest.