key: cord-0301353-y0rb5lyl authors: Zhang, Zhengjun title: ATP6V1B2 and IFI27 and their intrinsic functional genomic characteristics associated with SARS-CoV-2 date: 2022-01-14 journal: bioRxiv DOI: 10.1101/2022.01.13.476223 sha: 2b82bb152886dd8238f9baf0e05973dacf83e2fe doc_id: 301353 cord_uid: y0rb5lyl Genes functionally associated with SARS-CoV-2 and genes functionally related to COVID-19 disease can be different, whose distinction will become the first essential step for successfully fighting against the COVID-19 pandemic. Unfortunately, this first step has not been completed in all biological and medical research. Using a newly developed max-competing logistic classifier, two genes, ATP6V1B2 and IFI27, stand out to be critical in transcriptional response to SARS-CoV-2 with differential expressions derived from NP/OP swab PCR. This finding is evidenced by combining these two genes with one another gene in predicting disease status to achieve better-indicating power than existing classifiers with the same number of genes. In addition, combining these two genes with three other genes to form a five-gene classifier outperforms existing classifiers with ten or more genes. With their exceptional predicting power, these two genes can be critical in fighting against the COVID-19 pandemic as a new focus and direction. Comparing the functional effects of these genes with a five-gene classifier with 100% accuracy identified and tested from blood samples in the literature, genes and their transcriptional response and functional effects to SARS-CoV-2 and genes and their functional signature patterns to COVID-19 antibody are significantly different, which can be interpreted as the former is the point of a phenomenon, and the latter is the essence of the disease. Such significant findings can help explore the causal and pathological clue between SARS-CoV-2 and COVID-19 disease and fight against the disease with more targeted vaccines, antiviral drugs, and therapies. The fluctuations in infection rates of the COVID-19 pandemic has been like sea waves, with many small ones and several big ones in the past two years. In the meantime, variants of SARS-CoV-2 have emerged and made scientists and medical practitioners on high alert all the time, and many problems have remained unanswered 1;2;3;4;5;6;7;8;9;10;11 . In addition, there have been new concerns with COVID-19 disease, e.g., SARS-CoV-2 enters the brain 12 , COVID-19 vaccines complicate mammograms 13 , memory loss and 'brain fog' 14 , amongst others. However, these new concerns are observational and experimental outcomes, and they do not have genetic bases due to a lack of effective analytical methods to link COVID-19 to the concerned. Regarding gene expression samples, the literature didn't point out the significant difference between samples with differential expressions derived from NP/OP swab PCR and samples derived from these two genes and one other gene together can easily get overall accuracies between 87.2% and 89.74%, which reveals that these two genes can be fundamental. Combining all these five genes can get to an overall accuracy of 91.88%, the sensitivity of 94.62%, and specificity of 90.08%, which are higher than the classifiers with 10 genes or more in the literature. In the analysis of the second dataset, a combination of the above five genes led to an overall accuracy of 93.39%, a sensitivity of 98.37%, and specificity of 53.70%. Many other combinations will be illustrated in Data Section. These performance results from different combinations indicate that COVID-19 can have many different variants. Different from the studies in , the accuracy from any of the combinations applied to PCR gene expressions hasn't been up to 100%. There are three possible reasons, e.g., 1) samples themselves can be false positive or false negative from PCR tests; 2) sample signals were weak, and counts were inaccurate; 3) experimental conditions vary. We note that there are many zero expression values in the second dataset, which may be the reason for a low specificity. These two critical genes ATP6V1B2 (ATPase H+ Transporting V1 Subunit B2) and IFI27 (Interferon Alpha Inducible Protein 27) had previously been reported to be associated with several diseases. For example, de novo mutation in ATP6V1B2 was found to impair lysosome acidification and cause dominant deafnessonychodystrophy syndrome 23 , while IFI27 was found to discriminate between influenza and bacteria in patients with suspected respiratory infection 24 , among others. The significant differences of gene functional effects, gene-gene interactions, and gene-variants interactions between blood sampled gene expressions and PCR sampled gene expressions reveal that ATP6V1B2 and IFI27 are associated with SARS-CoV-2, which points to a new optimal direction of developing more effective vaccines and antiviral drugs. On the other hand, the functional effects of ABCB6, KIAA1614, MND1, SMG1, RIPK3 can be critical to understanding the disease. The contribution of this paper includes: 1) signifying the genomic difference between PCR samples and blood samples (hospitalized patients); 2) identifying single digit critical genes (ATP6V1B2, IFI27, BTN3A1, SERTAD4, EPSTI1) which are a transcriptional response to SARS-CoV-2; 3) presenting interpretable functional effects of gene-gene interactions, gene-variants interactions using explicitly mathematical expressions; 4) presenting graphical tools for medical practitioners to understand the genomic signature patterns of the virus; 5) making suggestions on developing more efficient vaccines and antiviral drugs; 6) identifying potential genetic clues to other diseases due to COVID-19 infection. The remaining part of the paper is organized as follows. Section 2 briefly reviews the studying methodology. Section 3 reports the data source, analysis results, and interpretations. Finally, Section 4 concludes the study. Many medical types of research, especially gene expression data related, applied the classical logistic regression as a starting base, then together with implementations of some advanced machine learning methods. However, Teng and Zhang (2021) 25 points out that classical logistic regression can only model absolute treatments, not relative treatments, and as a result, it has led (and will lead) to many supposedly efficient trials to be wrongly concluded as inefficient. Four clinical trials, including one COVID-19 study trial, were illustrated in their paper. Their new AbRelaTEs regression model for medical data is much more advanced than the classical logistic regression as it greatly enhances interpretability and truly being personalized medicine computability. Our new study in this paper is different from AbRelaTEs as we don't deal with treatment and control, and we use a new innovative method to study the existence of functional effects of genes associated with SARS-CoV-2. The competing risk factor classifier has been successfully applied in the literature 15;18;19;20 . This section briefly introduces necessary notations and formulas for self-contained due to different data structures used in this work. For continuous responses, the literature papers 26;27;28 deal with max-linear computing factor models and max-linear regressions with penalization. Max-logistic classifier has some connections to the logistic polytomous models but with different structures 29;30;31 . Suppose Y i is the ith individual patient's COVID-19 status (Y i = 0, 2 for COVID-19 free, Y i = 1 for infected) and X ip ), k = 1, . . . , K, being the gene expression values with p = 15979, 35784 genes in this study. Here k stands for the kth type of gene expression levels drawn based on K different biological sampling methodologies. Note that most published work set K = 1, and hence the supercript (k) can be dropped from the predictors. In this research paper, K = 4 as we have two datasets, and in the first dataset, there are other ARIs patients with other viral or non-viral. Using a logit link (or probit link, Gumbel link), we can model the risk probability p or alternatively, we write is a 1 × p observed vector, and β (k) is a p × 1 coefficient vector which characterizes the contribution of each predictor (gene in this study) to the risk. Considering there have been several variants of SARS-COV-2 and multiple symptoms (subtypes) of COVID-19 diseases, it is natural to assume that the genomic structures of all subtypes can be different. Suppose that all subtypes of COVID-19 diseases may be related to G groups of genes where i is the ith individual in the sample, g j is the number of genes in jth group. The competing (risk) factor classifier is defined as is a g j ×1 coefficient vector which characterizes the contribution of each predictor in the jth group to the risk. , all components compete to take the most significant effect. (3) is reduced to the classical logistic regression, i.e., the classical logistic regression is a special case of the new classifier. Compared with blackbox machine learning methods (e.g., random forest, deep learning (convolution) neural network (DNN, CNN)) and regression tree methods, (3) shows clear patterns. Each competing risk factor forms a signature with the selected genes. The number of factors corresponds to the number of signatures, i.e., G. This model can be regarded as a bridge between linear models and more advanced (blackbox) machine learning methods. However, (3) remains the desired properties of interpretability, computability, predictability, and stability. Note that this remark is the same as Remark 1 20 . In practice, we have to choose a threshold probability value to decide a patient's class label. Following the general trend in the literature, we set the threshold to be 0.5. As such, if p (k) i ≤ 0.5, the ith individual is classified as disease free, otherwise the individual is classified to have the disease. With the above established notations and the idea of quotient correlation coefficient 32 , Zhang (2021) 20 introduces a new machine learning classifier, smallest subset and smallest number of signatures (S4) as Two COVID-19 datasets to be analyzed are publicly available at https://github.com/czbiohub/covid19transcriptomics-pathogenesis-diagnostics-results 21 and as GSE152075 22 . The first dataset contains 15979 genes, 93 patients with PCR tested COVID-19 positive, 41 patients with viral acute respiratory illnesses (ARIs) and COVID-19 negative, and 100 non-viral acute respiratory illnesses (ARIs) COVID-19 negative. The second dataset contains 35784 genes, individuals with PCR confirmed SARS-CoV-2, and 54 negative controls. We note that there are many gene expression values in the second dataset being zero. Solving the optimization problem (4) among all genes (15979 and 35784), with different combinations, various competing classifiers can be identified. Although, as discussed in Introduction, the gene expression data used in this study were drawn from PCR samples (not blood samples), 100% accurate classifiers with a single-digit number of genes do not exist. Also, with the same accuracy (smaller than 100%), different combinations of genes can be candidate classifiers. Therefore, we report the best-performed classifiers in this subsection. After an extensive Monte Carlo search of the best combinations of genes, five genes, ATP6V1B2, IFI27, BTN3A1 (Butyrophilin Subfamily 3 Member A1), SERTAD4 (SERTA Domain Containing 4), EPSTI1 (Epithelial Stromal Interaction 1), are found to form the S4 classifiers. Given the first dataset has three categories (COVID-19 positive, ARIs with non-SARS-CoV-2 viral, ARIs without viral), we also study the classification between COVID-19 positive and ARIs with non-SARS-CoV-2 viral, and between COVID-19 positive and ARIs without viral, which leads to K = 4 as stated in the prior subsection. Note that in (3) each individual component itself is a classifier which has the following form where (β 0 , β 1 , . . . , β 5 ) are coefficients. In the subsequent subsections, we use tables to present individual (CF i,j ) and combined (CFmax j ) classifiers representing (5), where i is the index for classifier, and j is for dataset. The risk probabilities of each component classifier are and the risk probabilities based on all three component classifiers together are Pmax j = exp CFmax j 1 + exp CFmax j , j = 1, 2. 3.3 First dataset: Three-gene classifiers (G = 1) Note that the results in this subsection are not from our final best-performed classifiers. We found that a combination of ATP6V1B2 and IFI27 with many other genes can lead to high accuracy classifiers. We present their performance combined with the remaining genes of the best subset of five genes in this paper and one of the five critical genes found by Zhang 15 . Tables 1 and 2 summarize the results. In both Tables 1 and 2 , we see that the coefficient signs of ATP6V1B2 and IFI27 are the same across all individual classifiers, which is a strong indication that they are truly associated with the virus. Although gene RIPK3 plays a key role in the perfect classifier identified in Zhang 15 , its performance is inferior to the other three genes identified from PCR samples in this paper. This phenomenon reflects the discussions in Introduction that RIPK3 is related to the natural essence of COVID-19, while ATP6V1B2, IFI27, BTN3A1, SERTAD4, and EPSTI1 contain more information about SARS-CoV-2. We note that for BTN3A1, its combinations with ATP6V1B2 and IFI27 can have numerous types, which also leads to the same accuracy; for SERTAD4, there are numerous combinations with ATP6V1B2 and IFI27; and the same is true for EPSTI1. The coefficients listed in Table 1 are just a particular type of coefficient. Also, for EPSTI1, we can get different sensitivities and specificities while maintaining the same accuracy. Among four genes (BTN3A1, SERTAD4, EPSTI1, and RIPK3), EPSTI1 has the best performance in Tables 1 and 2 . This empirical evidence proves that ATP6V1B2 and IFI27 are at the center of genes associated with SARS-CoV-2. Our extensive Monte Carlo search leads to the best solution of the accuracy of 91.82% to the optimization problem (4) as five genes, i.e., ATP6V1B2, IFI27, BTN3A1, SERTAD4, and EPSTI1 though the solution is not unique. These five genes stand out after comparing solutions for all three categories in the first dataset. Tables 3-5 summarize the results. Table 6 demonstrates part of patients' expression values of the five critical genes, competing classifier factors, predicted probabilities. Note that due to very relative large scales in Columns CF-1, CF-2, CFmax, they are rescaled by a division of 100 when computing the risk probabilities as very large values can result in an overflow in computation. The validity of rescaling was justified in Zhang 17 . Figure 1 presents critical gene expression levels and risk probabilities corresponding to different combinations in the first dataset and Tables 3-5. It can be seen that each plot shows a genomic signature pattern and functional effects of genes involved. From Tables 1-5, we can immediately see that the coefficient signs associated with ATP6V1B2 are uniformly negative, which shows that increasing the expression level of ATP6V1B2 will decrease the virus (SARS-CoV-2) strength; the coefficient signs associated with IFI27 are uniformly positive, which shows that decreasing the expression level of IFI27 will decrease the virus (SARS-CoV-2) infection strength. Such functional effects of ATP6V1B2 and IFI27 can also be clearly seen in Figure 1 around origins which show the higher the IFI27 level, the higher the risk probability (yellow color); the higher the ATP6V1B2 level, the lower the risk probability (blue color). These observations show that ATP6V1B2 and IFI27 are in the circle of genes associated with SARS-COV-2. BTN3A1 appears three times in Tables 3-5 with positive coefficients, which shows decreasing the expression level of BTN3A1 will decrease the virus (SARS-CoV-2) infection strength. The coefficient signs of SERTAD4 and the coefficient signs of EPSTI1 show both positive and negative in Tables 3-5 depending on the ways of genes being combined. These phenomena explain the reason SARS-CoV-2 variants have emerged as variants can be related to different coefficient signs corresponding to genes. Figure 2 is a Venn diagram to illustrate the performance of each classifier and the combined classifier. In Venn diagram, those patients who fall in the intersections are relatively easy to be tested and confirmed positive, while for those who only fall in one category, it is relatively hard to test and confirm their status. Two individual classifiers can be explained as having two times COVID-19 tests using two different testing procedures, and with both tests being positive, the probability of infection will be higher depending on the sensitivity and the specificity of each test. Summarizing Tables 3-5 and Figure 2 , mathematically speaking, SARS-CoV-2 can have 3 × 3 × 3 × 4 = 108 variants with some of them being insignificant from dominant ones while some of them being dominant and having emerged (or will emerge), where the multiplier 3 corresponds to 3 classes in one Venn diagram, and similarly, other numbers are interpreted. We note that the joint functional effects of genes are not directly observable, and the meaning of variants is defined by the joint functional effects. As a result, the variants of the virus are not directly referred to what has been known in the literature and practice. Comparing the individual classifiers and combined classifiers among COVID-19 vs. all others, COVID-19 vs. ARIs with other viral, and COVID-19 vs. without viral, we see that the combined classifier for the case of COVID-19 vs. without viral works the best. We found some ARIs with other viral may be COVID-19 patients but not yet confirmed. If we apply the classifier in Figure 2 bottom-right panel, we can get sensitivity up to 98.94% with a slight loss of specificity. The five genes, ATP6V1B2, IFI27, BTN3A1, SERTAD4, EPSTI1, achieved superior performance in classifying patients in their respective groups. In this subsection, we test their performance in a second dataset. One significant difference between these two datasets is that the patients in the first study (dataset) are either COVID-19 positive or ARIs with other viral or ARIs without viral, while the patients in the second study (dataset) are PCR confirmed SARS-CoV-2 or negative controls. As a result, genes found to be critical from the first dataset can be thought of as SARS-CoV-2 specific. It turned out that those five genes are also the best subset for the second dataset. Table 7 presents the individual classifier and the combined classifier. Data are ln(raw+1) normalized. We can see that the signs of ATP6V1B2, IFI27 in CF1 remain the same as their counterparts in Tables 1-5 while the sign of ATP6V1B2 changed in CF2. This phenomenon is not surprising as CF1 has 91.32% overall accuracy, while CF2 has only 27.07% accuracy. This table again supports our earlier claim that ATP6V1B2, IFI27 are in the circle of critical genes associated with SARS-CoV-2. Note that individual classifiers in the second dataset involve all five genes while counterparts in the first dataset only involve three genes. This phenomenon can be explained as the patients' attributes from these two datasets are different. Next, we compute the correlations among those five genes for each dataset. Table 8 presents pairwise correlations in a matrix form in which the upper triangle is for the first dataset, and the lower triangle is for the second dataset. Table 8 shows different correlation structures among the five genes, which shows the difference of classifiers between two datasets is reasonable. The results presented in this paper are the first to directly associate a few critical genes with SARS-CoV-2 with the best performance (relative to other subsets with the same number of genes). Furthermore, the results signify the genomic difference between PCR samples and blood samples (hospitalized patients), identify single digit critical genes (ATP6V1B2, IFI27, BTN3A1, SERTAD4, EPSTI1) which are a transcriptional response to SARS-CoV-2, interpretable functional effects of gene-gene interactions, gene-variants interactions using explicitly mathematical expressions, introduce graphical tools for medical practitioners to understand the genomic signature patterns of the virus, make suggestions on developing more efficient vaccines and antiviral drugs, and finally identify potential genetic clues to other diseases due to COVID-19 infection. In Zhang 17 , a conceptual visualization of the gene-gene relationship was created. At the top of the figure, virus variants were placed. With new findings of this paper, six signature patterns from Tables 3-5 can be used to replace those virus variants, and then a complete dynamic flow can be formed. As discussed in Introduction, the genes identified in Zhang 17 are hypothesized to link to the root cause of COVID-19, while the genes identified in this study are the key to treat the symptoms. Based on the findings in this paper, we make the following hypotheses. Hypothesis 1 is based on the mathematical and biological equivalence between COVID-19 disease and the functional effects of these five genes proved in Zhang 17 . At the moment, testing Hypothesis 2 is more urgent than testing Hypothesis 1 given variants of SARS-CoV-2 have been emerging, and waves of COVID-19 have been arriving one after another. Once Hypothesis 2 is tested and confirmed, scientists can test their counterparts from animals, trace the virus origin, and find the intermediate host species of SARS-CoV-2. As to Hypothesis 3, in Zhang (2021), a combination of CDC6 and ZNF282 (Zinc Finger Protein 282) can lead to 97.62% accuracy (98% sensitivity, 96.15% specificity), which suggests the protein encoded by CDC6 is a protein essential for the initiation of RNA replication. As mentioned in Introduction, ATP6V1B2 was found to impair lysosome acidification and cause dominant deafness-onychodystrophy syndrome 23 , while IFI27 was found to discriminate between influenza and bacteria in patients with suspected respiratory infection 24 . There have been new concerns with COVID-19 disease, e.g., SARS-CoV-2 enters the brain 12 , COVID-19 vaccines complicate mammograms 13 , memory loss and 'brain fog' 14 . Using the findings from this paper, we may hypothesize that ATP6V1B2 can be a leading factor causing COVID-19 to brain function and ENT problems. As to IFI27, given that COVID-19 is a respiratory tract infection, it makes sense to hypothesize IFI27 is the infection's key. EPSTI1 has been found related to breast cancer and oral squamous cell carcinoma (OSCC) and lung squamous cell carcinoma (LSCC) 33 , which may link COVID-19 to what has been found in mammograms complication 13 . Liang et al. 34 suggests that BTN3A1 may function as a tumor suppressor and may serve as a potential prognostic biomarker in NSCLCs and BRCAs. However, all of these findings have not been confirmed. A confirmed Hypothesis 2 may help further explore whether these genes reported in the literature are truly effective, as suggested in the literature. Finally, with the proven existence of signature patterns associated with SARS-CoV-2 and COVID-19, variants of the disease will continue to emerge if the problems revealed by the existing signatures are not solved. We have witnessed that each time after a peak of the COVID-19 pandemic, the world saw hopes of the end of the pandemic, and the public lowered their guard; as a result, another wave (small or big) appeared. As such, we shouldn't forget the pain where the gain follows as existence determines recurrence noted by Murphy's law "Anything that can go wrong will go wrong." Doctors and nurses want more data before championing vaccines to end the pandemic: Health systems are launching bids to assure their medical workers that vaccines will be safe and effective. CNN The quest to find genes that drive severe covid Mapping the human genetic architecture of covid-19 Genetic mechanisms of critical illness in covid-19 Dexamethasone in hospitalized patients with covid-19 Development and validation of a clinical and genetic model for predicting risk of severe covid-19 Inborn errors of type i ifn immunity in patients with life-threatening covid-19 Autoantibodies against type i ifns in patients with life-threatening covid-19 Failure to replicate the association of rare loss-of-function variants in type i ifn immunity genes with severe covid-19. medRxiv Genetic association analysis of sars-cov-2 infection in 455,838 uk biobank participants. medRxiv Association of toll-like receptor 7 variants with life-threatening covid-19 disease in males: findings from a nested case-control study. eLife The s1 protein of sarscov-2 crosses the blood-brain barrier in mice Covid-19 vaccines complicate mammograms Assessment of Cognitive Function in Patients After COVID-19 Infection Five critical genes related to seven Covid-19 subtypes: A data science discovery Large-scale multi-omic analysis of covid-19 severity The existence of at least three genomic signature patterns and at least seven subtypes of covid-19 and the end of the disease. Editor Invited Minor Revision Submitted, waiting for editor's decision Lift the veil of breast cancers using four or fewer critical genes. Cancer Informatics, In press A perfect classifier for machine learning heterogeneous cohort studies: The puzzle and the future of colorectal cancers told by four critical genes Functional effects of four or fewer critical genes linked to lung cancers and new subtypes detected by a new machine learning classifier Upper airway gene expression reveals suppressed immune responses to sars-cov-2 compared with other respiratory viruses In vivo antiviral host transcriptional response to sars-cov-2 by viral load, sex, and age A novel immune biomarker ifi27 discriminates between influenza and bacteria in patients with suspected respiratory infection De novo mutation in ATP6V1B2 impairs lysosome acidification and causes dominant deafness-onychodystrophy syndrome Directly and simultaneously expressing absolute and relative treatment effects in medical data models and applications Polychotomous quantal response by maximum indicant Max-linear competing factor models Maxlinear regression models with regularization Econometric Models for Probabilistic Choice among Products Advanced Econometrics Discrete Data Models Quotient correlation: a sample based alternative to Pearson's correlation. The Annals of Statistics Contrasting functions of the epithelialstromal interaction 1 gene, in human oral and lung squamous cell cancers Comprehensive analysis of btn3a1 in cancers: mining of omics data and validation in patient samples and cellular models To be added later. Real data and computer outputs are in a supplementary file available online and submitted together with this paper. A Matlab R demo code for solving A in Equation (4) (λ 2 = 0) is also available. The datasets are publicly available. The data links are stated in Section Data Description. The author declares no competing interests. Solving optimization problems (4) involves combinatorial optimization, integer programming, and continuous programming. The computational complexity is exceptionally high, and we haven't figured out how to define the complexity. We used an extensive Monte Carlo search method to find the best solution. However, we cannot guarantee whether additional sets of genes can also be the optimal solutions. Although we have identified functional effects by gene-gene interactions and gene-subtype (variants) interactions of the five genes, we haven't identified how gene-gene interacts with each other and their causal directions. We are working in this direction. Due to the lack of available new sampled data for new variants, it's difficult to infer the risks of variants. Finally, our results are in the field of computational biology/medicine, and they are not lab-confirmed.