key: cord-0758670-chzrwqv7 authors: Hittner, James B.; Fasina, Folorunso O.; Hoogesteijn, Almira L.; Piccinini, Renata; Maciorowski, Dawid; Kempaiah, Prakasha; Smith, Stephen D.; Rivas, Ariel L. title: Testing-Related and Geo-Demographic Indicators Strongly Predict COVID-19 Deaths in the United States during March of 2020 date: 2021-10-01 journal: Biomed Environ Sci DOI: 10.3967/bes2021.102 sha: c9b2082af0db5e79fbb4a5ba680b0d88f40c670f doc_id: 758670 cord_uid: chzrwqv7 nan The COVID-19 pandemic has wreaked havoc around the globe and caused significant disruptions across multiple domains [1] . Moreover, different countries have been differentially impacted by COVID-19 -a phenomenon that is due to a multitude of complex and often interacting determinants [2] . Understanding such complexity and interacting factors requires both compelling theory and appropriate data analytic techniques. Regarding data analysis, one question that arises is how to analyze extremely non-normal data, such as those variables evidencing L-shaped distributions. A second question concerns the appropriate selection of a predictive modelling technique when the predictors derive from multiple domains (e.g., testing-related variables, population density), and both main effects and interactions are examined. To address these questions, we propose a novel statistical approach for analyzing and understanding complex data interactions. Using data collected in the USA during the first month in which COVID-19 testing was performed (March of 2020 Supplementary Table S1 available in www. besjournal.com), we examined the following six predictors of COVID-19 related deaths: (i) the proportion of all tests conducted during the first week of testing; (ii) the cumulative number of (testpositive) cases through 3-31-2020; (iii) the number of tests performed/ million inhabitants; (iv) the cumulative number of inhabitants tested; (v) the number of cases/million inhabitants (cases/mill inh); and (vi) the number of diagnostic tests performed in week one of testing/million inhabitants/state-specific population density (w1DT/MI/PD), where "population density " is defined as the number of inhabitants per square kilometer. The purpose of this study was to examine the ability of the six variables to predict COVID-19 related deaths in the United States during March of 2020. We ran the predictive model twice, once for each dependent variable: mortality count (overall number of deaths), and deaths per million inhabitants. Because our model (a) uses predictors that leverage information from multiple domains, (b) captures both nationwide and state-specific dimensions, and (c) examines two different mortality-related outcomes, the results are expected to have relevance for policy-makers. All data used in this study were obtained from three sources in the public domain: Worldometer (https://www.worldometers.info/coronavirus/), World Population Review (https://worldpopulationreview. com/states), and Covidtracking (https://covidtracking. com/). The data were processed and analyzed using IBM SPSS, Minitab, and R. Univariate skewness and kurtosis values indicated that all predictors and outcomes were non-normally distributed, with a few variables evidencing L-shaped distributions. The Lshaped variables were normalized using the rankbased inverse normal (RIN) transformation [3] . For extremely non-normal data, the RIN method is a highly effective normalizing transformation [3] . The prediction models were first examined using linear multiple regression, with the RIN-transformed versions of all variables used in the regressions. Because the homoscedasticity assumption (i.e., constant variance of the predicted Y-values) was not met, we re-ran the prediction models using a nonparametric approach known as Kernel Regularized Least Squares (KRLS) Regression [4] . KRLS is an appropriate method to use when the assumptions of linear regression are not met and the precise functional forms between the predictors and outcomes are unknown. All KRLS regressions used the RIN-transformed variables and all analyses were performed using the KRLS package for R. The use of non-parametric, machine learning-based methods such as KRLS is consistent with recent calls to place greater reliance on artificial intelligence systems for understanding the causes and consequences of the COVID-19 pandemic [5] . The KRLS regression results are presented in Table 1 . For number of deaths, the six predictors accounted for 98.8% of the variance. Five of the predictors were statistically significant (P-values ≤ 0.002). Two of the significant predictors (i.e., number of test-positive cases, Cohen's d = 2.3; and cases per million inhabitants, Cohen's d = 1.3) represent different ways of quantifying the illness burden due to SARS-CoV-2 infection. The ratio of the two d values indicated that the predictive strength of number of test-positive cases was 77% greater than was cases per million inhabitants. Regarding the second dependent variable, the six predictors accounted for 92.6% of the variance in deaths per million inhabitants. Five of the predictors were significant (P-values ≤ 0.03). For this regression analysis, the number of test-positive cases (d = 1.1) and cases per million inhabitants (d = 1.4) were similar in predictive strength. In addition to number of test-positive cases and cases per million inhabitants, another interesting predictor was our geo-demographic variable (i.e., the number of diagnostic tests/million inhabitants/ population density performed in week one of testing, or w1DT/MI/PD). This predictor was significantly associated with both dependent variables. Because w1DT/MI/PD is a complex, ratiobased predictor, discerning the precise nature of its predictive association from a single regression As the lowess curve in the top panel of Figure 1 indicates, at higher and medium levels of w1DT/MI/PD, the association between the geodemographic predictor and death count was strongly negative and moderately negative, respectively. In contrast, at lower levels of w1DT/MI/PD, there was little if any association between the geodemographic variable and number of fatalities. The bottom panel of Figure 1 indicates that at lower levels of w1DT/MI/PD, the association between the geo-demographic variable and deaths per million inhabitants was moderately positive. At medium levels of w1DT/MI/PD, there was little if any association between the two variables. Finally, at higher levels of w1DT/MI/PD, there was a moderately strong negative association between the geo-demographic variable and deaths per million inhabitants. In constructing our geo-demographic predictor variable, we controlled for population density because it is an important factor associated with disease transmission [6] . Moreover, because there typically is a lag time of several weeks or more between being infected with SARS-CoV-2 and showing disease-related symptoms, the association between population density and disease-related deaths should strengthen over time. To highlight this point, Figure 2 presents scatterplots showing the Pearson correlations between population density and cumulative COVID-19 related deaths per million inhabitants through March 31 st and June 17 th , 2020, respectively. The correlations were as follow: March 31 st (r = 0.228, P > 0.05); June 17 th (r = 0.800, P < 0.01). The difference between the two statistically dependent correlations was evaluated using Hittner, May and Silver's modification of Dunn and Clark's z test [7] . The two correlations were significantly different (z = 5.85, P < 0.0001), thereby supporting the prediction that the association between population density and COVID-19 related deaths will strengthen over time. To the best of our knowledge, this is the first study that examines testing-, case count-and geodemographic variables as predictors of COVID-19 related deaths. Using a flexible, machine learningbased approach (KRLS regression), we found that our predictors accounted for very high percentages of outcome variance (98.8% and 92.6% for number of deaths and deaths per million inhabitants, respectively). Furthermore, with very few exceptions, our predictors were both statistically significant and practically important. One novel contribution of this study was our examination of a complex, ratio-based geodemographic predictor variable. This variable-the number of diagnostic tests performed in week one of testing/million inhabitants/state-specific population density (w1DT/MI/PD)-significantly predicted COVID-19 related deaths, but did so differently depending on where, along the continuum of geo-demographic values, the predictive association was examined. At the lower end of the geo-demographic predictor, more tests during week one per million inhabitants, normalized by population density, were associated with more deaths per million citizens. In contrast, at the higher end of the geo-demographic predictor, more tests during week one per million inhabitants, normalized by population density, were associated with fewer deaths per million inhabitants. These different quantitative patterns could reflect different qualitative situations. In the first case (lower values on the geo-demographic variable, where more tests are associated with more deaths), testing seems to pursue a confirmatory purpose. In contrast, for the second case (higher values on the geo-demographic variable, where more tests are associated with fewer deaths), diagnostic testing appears to be emphasized [8] . One implication of these findings is that when examining our geo-demographic variable as a predictor of deaths, the inflection points along the lowess curves (the positions where the slope rises and falls) can serve as approximate cut-points demarcating three types of testing: confirmatory, diagnostic, and other. When testing prioritizes symptomatic cases, it is expected that most tested individuals will result in positive results (infection will be confirmed). Because deaths will occur within a subset of infected individuals, when testing is confirmatory (when only symptomatic patients are tested), more tests will be associated with more deaths. In contrast, when asymptomatic individuals are also tested, more tests, conducted earlier, will allow clinicians to detect, treat, and isolate infections earlier and prevent further viral dissemination which, in turn, will result in fewer deaths/million inhabitants. Our findings thus support an important recommendation from the World Health Organization, which is that early and frequent testing helps to prevent deaths [9] . In addition to the contributions described above, we performed supplemental analyses examining the association between population density and COVID-19 related deaths. The role of population density in predicting epidemic dispersal and epidemic-related deaths is receiving increased research attention [10] . To the best of our knowledge, the present study is the first to demonstrate that the magnitude of association between population density and COVID-19 related deaths strengthens as the time since first infection increases. Understanding how factors such as testing frequency, the relative proportion of confirmatory versus diagnostic testing, and sociodemographic composition influence the temporal association between population density and COVID-19 related deaths is an important priority for future research. Overall, our findings highlight the importance of considering predictor variables from multiple domains. When ratio-based predictors such as our geo-demographic variable are analyzed, we recommend examining lowess curves as a visual interpretational aid for explicating the (often) complex non-linear associations between such ratiobased predictors and various outcomes of interest. An important direction for future research on epidemic dissemination and potential control is to examine both ratio-based composite variables-such as our geo-demographic measure-and traditional multiplicative interaction terms (created as linear products of two or more variables). The joint examination of both types of complex variables might result in greater predictive power and/or might foster additional insights into the dynamics of infectious diseases, such as COVID-19. Effects of COVID-19 pandemic in daily life Multidisciplinary research priorities for the COVID-19 pandemic: a call for action for mental health science Testing the significance of a correlation with nonnormal data: comparison of Pearson, spearman, transformation, and resampling approaches Kernel Regularized Least Squares: reducing misspecification bias with a flexible and interpretable machine learning approach Artificial intelligence (AI) applications for COVID-19 pandemic Connecting network properties of rapidly disseminating epizoonotics A Monte Carlo evaluation of tests for comparing dependent correlations Why only test symptomatic patients? Consider random screening for COVID-19. Appl Health Econ Health Policy Global research and innovation forum: towards a research roadmap High population densities catalyse the spread of COVID-19 Biographical note of the first author: James B. Hittner, male, born in 1965, PhD Degree, majoring in clinical and applied psychology, risky behavior, statistical methodology, and infectious disease dynamics.Received: August 10, 2020; Accepted: January 18, 2021