key: cord-0781193-yq7rn03h authors: Luellen, E. title: A machine learning explanation of the pathogen-immune relationship of SARS-CoV-2 and machine learning models of prognostic biomarkers to predict asymptomatic or symptomatic infections date: 2020-07-29 journal: nan DOI: 10.1101/2020.07.27.20162867 sha: f88043e7245db74577a0dffe5ae7bef6e7ffb0d1 doc_id: 781193 cord_uid: yq7rn03h Asymptomatic people infected during the SARS-CoV-2 pandemic have outnumbered symptomatic people by an approximate ratio of 4:1 with little understanding to date as to why; therefore, they have been difficult to identify. Moreover, studies indicate that most asymptomatic virus-positive patients are infectious, thereby creating a new public health danger via a plethora of "silent spreaders." This data science study identified four novel discoveries that may significantly impact our understanding of the pathogen-immune relationship: (1) Spearman rho correlation coefficients and associated P-values identified 33 of 55 common immune factors have statistically significant associations with SARS-CoV-2 morbidity, their direction (+/-) and strength to inform research and therapies; (2) five machine learning algorithms were applied to 74 observations of these 33 immunological variables and identified three models of prognostic biomarkers that can classify and predict who will be asymptomatic or symptomatic if infected with 94.8% to 100% accuracy; (3) a random forest of 200 decision trees ordinally ranked the 33 statistically significant independent predictor variables by their relative importance in predicting SARS-CoV-2 symptoms; and, (4) three different decision-tree algorithms separately identified and validated three immunological biomarkers and levels that nearly always differentiate asymptomatic patients: SCGF-Beta; (> 127637), IL-16 (> 45), and M-CSF (> 57). The implications of these findings are they indicate a tool that can identify in advance the 20% of people who are at higher risk of morbidity from infection and suggests a specific stem-cell factor for therapeutics. This section discusses what was known and unknown on this topic, and the resulting hypothesis. This introduction also puts the importance of these findings and the use of machine learning modeling into context. While it has been speculated that stem cells may play a role in SARS-CoV-2 and other zoonoses' resistance, prior research has focused on different stem cell involvement than stem-cell growth factorbeta (Yu, 2020) (Golchin, 2020) (Chrzanowski, 2020) . Previous research has also established that stem cells can inhibit viral growth by expressing interferon-gamma stimulated genes (ISGs) and have been particularly effective against influenza A H5N1 virus and resulting lung injuries (Wu, 2018) (Chan, 2016) . Stem cell therapy (SCT) has been hypothesized as a treatment for SARS-CoV-2; however, there is no record in the literature specific as to which factors may influence SARS-CoV-2 infections, favorably or unfavorably, or to what degree until now (Florindo, 2020) . Elevated SCGF-β has also been associated with the specific disease states of hepatocellular cancer, Chagas' disease, cardiomyopathy, inflammation and insulin resistance, and unstable carotid plaques (Sukowati, 2018) (Wang, 2013) (Tarantino, 2020) . Interleukin 16, the second most important variable in predicting SARS-CoV-2 immunity or resistance here, has been strongly associated with asthma (Mathy, 2000) . Prior studies on the biomarkers associated with SARS-CoV-2 immune response and morbidity include interferon-gamma (IFN-γ), interferon-beta (IFN-β), and interleukin-8 (IL-8) (Geifman, 2020) . Other previous research on immune parameters associated with SARS-CoV-2 severity and prognosis have involved interleukin one beta (IL-1β) and interleukin six (IL-6). However, others found reduced immunoglobin G levels in asymptomatic patients (Jesenak, 2020) (Long, 2020) . The general finding in prior research regarding the pathogen-immune relationship with SARS-CoV-2 is that symptomatic patients have considerably more inflammation and cytokine storm activity than asymptomatic patients (Long, 2020) . What has been unknown for SARS-CoV-2 are three questions to which the answers are suggested in this study. One, which immunological variables are statistically significant, and how important is each in predicting asymptomatic status? Two, which of those variables, if any, have a strong negative correlation, or relationship, with disease severity (i.e., asymptomatic patients' levels are significantly higher than symptomatic patients)? Three, is there an algorithmic or formulaic model of prognostic biomarkers that can accurately predict morbiditywho will be asymptomatic if infected, and who is at risk of more severe symptoms and disease progression -and why? This study was based on secondary data published as a supplement in Nature Medicine in June 2020. Therein immunological factors were measured in 74 patients in the Wanzhou District of China. They were diagnosed with SARS-CoV-2 positive by reverse transcriptase-polymerase chain reaction (RT-PCR) in the 14 days before the recordation of the observations. The median age of the 37 asymptomatic patients was 41 years (range 8-75 years), 22 were female, 15 were male. For comparison, 37 RT-PCR test-positive patients were selected and matched to the asymptomatic group by age, comorbidities, and sex (Long, 2020) . In this study, five algorithms, or types, of machine learning -a kind of artificial intelligence employing robust brute-force statistical calculations -were applied to a data set of 74 observations of 33 immunological factors to attempt to do three things: (1) develop a model to accurately predict the classification of which patients will be asymptomatic or symptomatic to SARS-CoV-2; (2) determine the relative importance of each immunological factor; and, (3) determine if there is any level of a subset of immunological factors that can accurately predict which patients are likely to be immune or resistant to SARS-CoV-2. Minitab 19 (version 19.2020.1, Minitab LLC) was used to calculate means, 95% confidence intervals, Pvalues, and two-sample T-tests of statistical significance. Correlation coefficients were also computed using Minitab via Spearman rho because the data was distributed nonparametrically. A second classification and regression tree (CART) algorithm were also applied in Minitab to cross-validate decision tree results from R in Rattle. Minitab's CART methodology was initially described by Stanford University and the University of California Berkeley researchers in 1984 (Breiman, 1984) . The Rattle library (version 5.3.0, Togaware) in the statistical programming language R (version 3.6.3, CRAN) was used to apply five machine learning algorithms -a decision tree, extreme gradient boosting (XGBoost), linear logistic model (LLM), random forest, and support vector machine (SVM) -to learn which model, if any, could predict asymptomatic status, how accurately, and how. Rattle randomly partitioned the data to select and train on 80% (n=59), validate on 10% (7), and test on 10% (7) of observations. Two evaluation methods were used: (1) plots of linear fits of the predicted versus observed categorization; and, (2) a pseudo-R 2 measure calculated as the square root of the correlation between the predicted and observed values. Pseudo-R 2 measure results were evaluated twice, each using for evaluation data that were held back by being randomly selected during partitioning and averaging the two accuracy findings for the final results. Rattle's rpart decision tree was also used to identify if any levels of one or more immunological factors that could accurately diagnose someone was asymptomatic (i.e., via rules). The decision tree results reported here used 20 and 12 as the minimum number of observations necessary in nodes before the split (i.e., minimum split). The trees used 7 and 4 as the minimum number of observations in a leaf node (i.e., minimum bucket). The random forest analysis in Rattle began by running a series of differently sized random forest algorithms, ranging from 50 to 500 decision trees, to learn the optimum number of trees to minimize error. Each random forest consisted of a minimum of six variables, which was closest to the square root of the number of statistically significant variables, 33. The lowest error rate was approximately 200 decision trees, which was applied, using four variables at a time, which was the closest whole number to the square root of the number of predictors. The five machine learning models and CART classification trees were run, including and excluding SCGF-β to identify if there were alternative prognostic biomarkers and levels in the immune profile that could accurately classify and predict SARS-CoV-2 immunity. Thirty-three (33) of the 55 immunological factors (60%) were indicated as statistically significant by Pvalues less than .05 from a Spearman correlation. Conversely, 40% of the 55 immune factors had no statistically significant role in mediating whether a patient was asymptomatic or symptomatic to SARS-CoV-2. The 22 factors positively correlated with being symptomatic ranged from a minimum of .205 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 29, 2020. . https://doi.org/10.1101/2020.07.27.20162867 doi: medRxiv preprint (MCP-3) to a maximum of .781 (TRAIL). The 11 factors negatively associated with being symptomatic ranged from a minimum of -.866 (SCGF-β) to a maximum of -.276 (IFNα2) (see Table 1 ). Researchers have recently found that symptomatic patients generally have a more robust immune response to SARS-CoV-2 infection, culminating in cytokine storms in the worst cases. Conversely, asymptomatic patients have been found to have a weaker immune response (Long, 2020) . Because infections are causal to immune response, of particular interest in this study, were the most impactful immune-related variables that negatively correlated with asymptomatic status (i.e., variables that were greater for asymptomatic patients than symptomatic patients), which are highlighted in gray in Table 1 . When SCGF-β was included in the machine-learning analysis, two algorithms predicted and classified SARS-CoV-2 immunity or resistance by being asymptomatic with 100% accuracy: a decision tree, and XGBoost. When SCGF-β was excluded, a random-forest algorithm predicted and classified SARS-CoV-2 asymptomatic and symptomatic cases with 94.8% area under the ROC curve accuracy (95% CI 90.17% to 100%) (see Table 2 ). Notably, both the rpart decision trees and CART classification trees independently identified three prognostic biomarkers at specific levels that could classify asymptomatic and symptomatic cases with 95-100% accuracy. When SCGF-β was included, all asymptomatic cases had levels > 127656.8, while all symptomatic cases had levels < 127656.8 (see Figure 1 ). When SCGF-β was excluded, as a type of contingency analysis to understand prognostic biomarker levels in other factors better, IL-16 accurately classified asymptomatic cases > 44.59 and symptomatic cases < 44.59 in 90.4% of the cases. In the remaining 9.6% of cases where IL-16 > 44.59, all of them had M-CSF > 57.13 (see Figure 2 ). Two-sample T-tests for the four factors with the highest positive and negative correlation coefficients, interquartile ranges, outliers, and levels between asymptomatic and symptomatic patients that were statistically significant were computed to ordinally rank factors by their correlation coefficients (see Figure 3 ). A random forest analysis of the most important variables to accurately classify and predict SARS-CoV-2 patients by binary morbidity ordinally ranked the 33 statistically significant factors. Unsurprisingly, SCGF-β, and IL-16, followed by GRO-α and TRAIL, respectively, were the most critical factors in predicting morbidity (see Figure 4) . Finally, the results suggest that 3, 4, 9, 12, 13, 17 , and RANTES are of low importance, or comparative irrelevance, in the pathogen-immune relationship and, that SCGF-β, IL-16, HGF, INFNα2, LIF, CTACK, IL-1α, Eotaxin, GM-CSF, IL-1Rα, and IL-5 are critical in models to predict and classify asymptomatic or symptomatic SARS-CoV-2 cases accurately. This work's overarching importance is the identification of immunological factors for diagnoses, treatments, and pre-clinical prophylactic immunebased approaches to SARS-CoV-2 in the first seven months of a pandemic that experts now opine will last decades (Farrar, 2020) . Immunostimulant approaches are especially valuable because, unlike antivirals and vaccines, they may be given later in the course of the disease to optimize outcomes (Florindo, 2020). The primary importance of this work is machine learning algorithmic models that can predict with high accuracy, whether someone, once infected, will be asymptomatic or symptomatic from SARS-CoV-2. This knowledge gives clinicians new tools to identify populations in advance who appear to be at higher risk of danger from the virus. Such devices, especially once reproduced in a more extensive study, may also inform policy decisions as to who needs to shelter-in-place. Finally, because of the scale of this pandemic and practical constraints as to how many vaccination doses can be manufactured how quickly, such tools may become valuable in prioritizing vaccine administration to those in greatest need because they are at the higher biological and immunological risk. This work's secondary importance is a description of the cytokine and chemokine profile that is associated with asymptomatic or symptomatic SARS-CoV-2 infections. It enables a better understanding of the pathogen-immune relationship. These profiles provide insights into the biological pathways critical for SARS-CoV-2 progression. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 29, 2020. . https://doi.org/10.1101/2020.07.27.20162867 doi: medRxiv preprint As one example, stem cell factors secrete multiple factors that regulate immune cells and modulate them to restore tissue homeostasis. These results suggest that higher levels of SCF-β may better control immune responses to prevent the more robust reactions universally associated so far with highly symptomatic patients and, further, prevent high morbidity and mortality cytokine storms. A better understanding of the pathogen-immune relationship may enable researchers to prevent and treat SARS-CoV-2 patients more effectively with therapeutics currently untested and unused. This knowledge may also extend to similar zoonotic coronaviruses in the future. The tertiary importance of this work is identifying three immune factors and precise levels that appear to be prognostic biomarkers as to whether someone, once infected with the SARS-CoV-2 virus, will be immune or resistant, as demonstrated by being asymptomatic, or not. These insights also suggest new candidates for therapeutic research focused on the relatively newly identified and ill-understood SCGF-β and its role in the immunological process. The quaternary importance of this work is further proof that machine-learning methods can accurately and quickly identify critical elements of disease dynamics that accelerate understanding and improve outcomes during pandemics. This study has several limitations. One, it is unknown from the dataset how many days passed between exposure to the virus and immunological testing, or whether it was universally the same number of days. Two, because immune profiles are temporally sensitive, ideally, several tests would have been taken over several days, which did not occur (Janford, 2020) . Three, immunological signaling and processing are multifactorial and complex. Therefore, it is unclear why SCGF-Beta levels are categorically high in asymptomatic patients and low in symptomatic patients, or whether they are causal to SARS-CoV-2 response. Four, combinatorial and sequential analysis of these immunological elements may be an important future research area to optimize therapeutic research outcomes. Five, at least one study in a leading journal, Lancet, found that Chinese SARS-CoV-2 case data may have been misreported by as much as 400% (Tsang, 2020) . That study, and much higher case and fatalities numbers in over 200 countries, have created distrust and skepticism of SARS-CoV-2-related data originating in China. Future research could ameliorate these limitations and focus on a more extensive study group to attempt to reproduce the results. Moreover, a prospective case-control study of patients with decreased SCFG-β levels and supplementation was protective against SARS-CoV-2 severity and symptoms. One implication of these findings is that if we can predict the 80% of society who may be immune or resistant to SARS-CoV-2, or asymptomatic, it may profoundly impact public health intervention decisions as to who needs to be protected and how much. If, for example, 80% of the shelter-in-place orders and the resultant dramatic reduction in economic and social activity could have been prevented by accurately predicting who is at low risk of infection, the economic benefits alone may have been valued in US$ trillions. The second implication of these findings is evidence that elevated levels of SCGF-β, IL-16, and M-CSF may have a causal relationship with SARS-CoV-2 immunity or resistance may have utility as diagnostic determinants to (a) inform public health policy decisions to prioritize and reduce shelter-in-place orders to minimize economic and social impacts; (b) advance therapeutic research; and, (c) prioritize vaccine distribution to benefit those with the greatest need and risks first. The author wishes to thank Dr. Luka Fajs for his helpful comments on drafts of this article. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 29, 2020. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 29, 2020. . https://doi.org/10.1101/2020.07.27.20162867 doi: medRxiv preprint . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 29, 2020. . https://doi.org/10.1101/2020.07.27.20162867 doi: medRxiv preprint . Figure 1 : CART classification tree of role of SCGF-β in predicting SARS-CoV-2 morbidity . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 29, 2020. . https://doi.org/10.1101/2020.07.27.20162867 doi: medRxiv preprint CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 29, 2020. . https://doi.org/10.1101/2020.07.27.20162867 doi: medRxiv preprint Figure 3 : Two-sample T-tests of statistical significance of difference in means of four leading prognostic biomarkers for asymptomatic or symptomatic SARS-CoV-2 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 29, 2020. . https://doi.org/10.1101/2020.07.27.20162867 doi: medRxiv preprint Figure 4 : Relative importance of immunological variables from random forest analysis in predicting SARS-CoV-2 morbidity . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 29, 2020. . https://doi.org/10. 1101 /2020 Classification and Regression Trees Human mesenchymal stromal cells reduce influenza A H5N1-associated acute lung injury in vitro and in vivo Can stem cells beat COVID-19: Advancing stem cells and extracellular vesicles toward mainstream medicine for lung injuries associated with SARS-CoV-2 infections Coronavirus: 'Infection here for many years to come A consideration of publication-derived immune-related associations in Coronavirus and related lung-damaging diseases Mesenchymal stem cell therapy for COVID-19: Present or future Immune parameters and COVID-19 infection -associations with clinical severity and disease prognosis Classification and regression by random forest. R News Clinical and immunological assessment of asymptomatic SARS-CoV-2 infections Interleukin-16 stimulates the expression and production of pro-inflammatory cytokines by human monocytes Elevated levels of endothelial-derived microparticles and serum CXCL9 and SCGF-β are associated with unstable asymptomatic carotid plaques Serum stem cell growth factor-beta for the prediction of therapy response in hepatocellular carcinoma Could SCGF-beta levels be associated with inflammation markers and insulin resistance in male patients suffering from obesity-related NAFLD? Effect of changing case definitions for COVID-19 on the epidemic curve and transmission parameters in mainland China: a modeling study Prognostic value of circulating levels of stem cell growth factor-beta in patients with Chagas' disease and idiopathic dilated cardiomyopathy Intrinsic immunity shapes the viral resistance of stem cells SARS-CoV-2 infection and stem cells: Interaction and intervention