key: cord-0771548-fv282z8v authors: Ribeiro, Tatiane Fontana; Cordeiro, Gauss M.; Peña-Ramírez, Fernando A.; Guerra, Renata Rojas title: A new quantile regression for the COVID-19 mortality rates in the United States date: 2021-09-29 journal: Comp DOI: 10.1007/s40314-021-01553-z sha: 10e6e52636e7b45f5fd11273963fc2fab8ef3bca doc_id: 771548 cord_uid: fv282z8v An outbreak of coronavirus disease 2019 (COVID-19) has quickly spread worldwide from December 2019, thus characterizing a pandemic. Until August 2020, the United States of America (U.S.) accounted for almost one-fourth of the total deaths by coronavirus. In this paper, a new regression is constructed to identify the variables that affected the first-wave COVID-19 mortality rates in the U.S. states. The mortality rates in these states are computed by considering the total of deaths recorded on 30, 90, and 180 days from the 10th recorded case. The proposed regression is compared to the Kumaraswamy and unit-Weibull regressions, which are useful in modeling proportional data. It provides the best goodness-of-fit measures for the mortality rates and explains [Formula: see text] of its variability. The population density, Gini coefficient, hospital beds, and smoking rate explain the median of the COVID-19 mortality rates in these states. We believe that this article’s results reveal important points to face pandemic threats by the State Health Departments in the U.S. The rest of the paper is structured as follows. A new regression to model the mortality rates in the American states is defined in Sect. 2. Further, the estimation of the parameters, a simulation study, and some goodness-of-fit measures to check the proposed regression's adequacy are discussed. Section 3 contains some basic statistics of the data set, performs an analysis by identifying the best regression to fit the mortality rates, and provides some useful findings. Finally, in Sect. 4, some concluding remarks are addressed. This section aims to introduce a new regression that has much broader applicability in coronavirus mortality rates. This approach's particular feature is that it accommodates doublebounded variables in the unit interval with several types of asymmetry. The proposal is based on the transformation Z = 1 − e −X , where X is a BXII random variable having cumulative distribution function (cdf) and probability density function (pdf) respectively, where c > 0 and d > 0 are shape parameters. It is worth noting that Z can also be seen as a reflected transformation on W , Z = 1 − W , where W is a random variable following a unit Burr XII (UBXII) distribution pioneered by . Hence, the cdf and pdf of the reflected unit Burr XII (RUBXII) distribution can be expressed as (for z ∈ (0, 1)) and respectively. By inverting (1), the quantile function (qf) of Z is Both the UBXII and RUBXII distributions are special cases of the unit extended Weibull family; see Guerra et al. (2020) . To introducing a systematic component on a location parameter, the RUBXII distribution is re-parameterized in terms of its quantiles. Let q(τ ) = Q Z (τ ; c, d) be the τ th quantile of Z . By evaluating Equation (3) in τ and inverting for d, Although the quantiles are functions of τ , q(τ ) is denoted just as q to simplify the notation. Then, by replacing (4) in Equations (1) and (2), the cdf and pdf of the RUBXII distribution expressed in terms of a quantile-based parameterization are (for z ∈ (0, 1)) and respectively, where c > 0 is a shape parameter and the quantile order τ ∈ (0, 1) is chosen by the researcher. Henceforth, let Z ∼ RUBXII(q, c) be a random variable having density (6). In some cases, median-based regressions are preferable to the mean-based. Median is a more robust measure against the presence of atypical observations and asymmetries at data than the mean. Thus, when data present these features, it is more suitable to consider the median as a measure of location than the mean (Pumi et al. 2020 ). In the coronavirus mortality rates application of Sect. 3, we consider τ = 0.5, and therefore, q = q(0.5) is the median of Z . Figure 1 displays the RUBXII density plots with τ = 0.5, which have the following forms: U, symmetric, right-skewed, increasing, and increasing-decreasing-increasing (tilde). Thus, it is useful for modeling variables with different types of skewness and heavy tails. Moreover, it can assume shapes (as tilde-shaped) whose densities of classical regressions for modeling unit data do not accommodate. On the proposed re-parametrization, the qf of Z is It is useful to generate observation from the RUBXII distribution by the inversion method since it has a closed-form. So, if U is a random variable having a standard uniform distribution, then Z = Q Z (U ) follows the RUBXII law. Let z = (z 1 , . . . , z n ) be a vector of n independent observations of the variables Z i ∼ RUBXII(q i , c) (for i = 1, . . . , n). The new regression is proposed assuming that the parameters q i can be expressed as a function of covariates under the systematic component where g : (0, 1) → R is a strictly monotonic and twice differentiable link function, η i is the linear predictor, and ξ = (ξ 1 , . . . , ξ k ) is the parameter vector associated with the covariates The quantities q i can be obtained by inverting (8) as q i = g −1 (η i ). Several link functions can be chosen for g(·) such as the logit, probit, and complementary log-log. In applications, the logit link function is generally considered due to the useful interpretation of the regression coefficients as an odds ratio. It is defined as g( p) = log[ p/(1− p)], and it is used in all fitted regressions here. The estimation of the parameters of the RUBXII regression is done by the maximum likelihood (ML) method. Let θ = (ξ , c) be the (k + 1)-dimensional parameter vector. The log-likelihood function based on a sample of n independent observations is where q i satisfies the systematic component (8) and i (q i , c) is the logarithm of the density f Z (z i ; q i , c) given in Eq. (6). Thus, The components of the score vector U (θ), given in Appendix 1, are defined as the partial derivatives of (9) with respect to each element of the parameter vector θ . Equalizing its components to zero, U (θ) = 0, and solving the system simultaneously, the maximum likelihood estimators (MLEs)θ = (ξ ,ĉ) of θ can be found. However, the system of equations is non-linear and cannot be solved analytically. In such a way, the estimators must be obtained through numerical optimization algorithms using well-known programming languages such as the R (optim function), SAS (PROC NLMIXED), and Ox program (MaxBFGS sub-routine). Some Monte Carlo experiments are carried out to assess the performance of the MLEs on the finite sample. Consider the systematic component for q i : Four scenarios with different simulation schemes, combining various values for the parameter vector θ = (ξ 1 , ξ 2 , c) , are considered. To evaluate the performance of the MLEs, for each scenario, the samples {(z 1 , x 12 ) , . . . , (z n , x n2 )} are simulated 10,000 times with n ∈ {30, 90, 160, 300}. The occurrences of the response Z i ∼ RUBXII(q i , c) are obtained by the inversion method through the qf in Equation (7). The covariate x i2 is generated from a uniform distribution on the interval (−3, 3) (scenarios 1 and 2), and a standard normal distribution (scenarios 3 and 4). The R programming language (R Core Team 2021) is used to perform the simulation study. The percentage relative bias (RB) and root mean squared error (RMSE) of the estimates in θ are determined. Table 1 lists the results for these measures. Low RB values are noted even for small sample sizes. Considering all the scenarios and sample sizes, the RBs of the estimates of ξ 1 and ξ 2 are less than 4%, and those of c are less than 15%. On the other hand, the RMSE quickly goes to zero when n increases, thus in agreement with the asymptotic properties of the MLEs. In this section, some methods are presented to analyze whether a fitted regression is suitable for a data set. As goodness-of-fit measures of the RUBXII regression, the maximized loglikelihood value (LL), a normality test for the quantile residuals (Dunn and Smyth 1996) , generalized pseudo-R 2 (R 2 G ), and a RESET-type test are considered. The same measures are adopted to compare the proposed regression with other suitable regressions for proportional data. The quantile residuals for the RUBXII regression are where F Z (·) is the cdf of the RUBXII distribution given in Eq. (5) and −1 (·) is the qf of the standard normal distribution. If the fit is adequate, it is expected that the distribution of the quantile residuals is close to the standard normal. To check whether this assumption is satisfied, the well-known Shapiro-Wilk (SW) normality test can be performed. The R 2 G is useful to assess the proportion of the response variable's variation explained by the regression. It is defined by Nagelkerke (1991) as where (θ 0 ) is the log-likelihood for the null model, i.e., modeling the response without covariates, and (θ ) is the log-likelihood of the fitted regression. A regression with a higher value of R 2 G provides a larger explanation power of the response variable's variation. A RESET-type test introduced by Pereira and Cribari-Neto (2014) can be adopted to detect possible specification errors in the regression. The null hypothesis of this test is that the regression is correctly specified. It may be conducted in the following way: (i) fit the regression and obtain the fitted valuesq = (q 1 , . . . ,q n ) of q = (q 1 , . . . , q n ) using (8); (ii) compute powers of second and third degrees ofq, i.e., getq 2 = (q 2 1 , . . . ,q 2 n ) and q 3 = (q 3 1 , . . . ,q 3 n ) ; and (iii) using these powers as additional covariates, fit the augmented regression, and test their significance through the likelihood ratio (LR) test. The (θ ) and (θ ) are the unrestricted and restricted maximized log-likelihood functions, respectively. Under the null hypothesis, ω converges in distribution to a chi-squared with ν degree of freedom, that is, ω where ν is the number of added test variables (ν = 2 in this case). In the first eight months of the coronavirus advance since its inception, on August 19, 2020 in the U.S., the Disease Control and Prevention (CDC) reported a total of 5,650,176 confirmed cases and 175,789 deaths, putting the disease with 3.1% lethality. Also, the adoption of systematic non-pharmaceutical interventions seems to have decreased mortality. Thus, understanding the relationship between demographic, socioeconomic, health care resources, and behavioral variables with the mortality rate became a crucial task. In this sense, this section presents the RUBXII regression's application, concurrently with two other well-known regression models, by associating the mortality rate with these possible predictor variables. The amount of information available on the disease is as abundant as it is scattered and unreliable. Therefore, before the analysis, data mining is built to construct the database described at the beginning of the section. The regression models chosen in this study consider an essential characteristic of the mortality rate: it belongs to the interval (0, 1). The response variable is the COVID-19 deaths rate in the U.S. states. This rate is calculated in the 50 states from data available by the CDC (Centers for Disease Control and Prevention 2020). For all states, it is considered the total of deaths per hundred people on 30, 90, and 180 days after the 10th detected case, to ensure that the comparisons are made to the same period. In this way, a panel with three observations for each state is structured. For all states, the population density, Gini coefficient, hospital beds, smoking rate, poverty rate, and life expectancy, are obtained from the following sources: after 90 days of the 10th confirmed case, and zero otherwise. 9. T 180 : dummy that is equal to one if the response observation corresponds to mortality rate after 180 days of the 10th confirmed case, and zero otherwise. Table 2 gives some descriptive measures of these variables. The MR has a high coefficient of variation (CV) for all current time periods, being the most at 30 days with a CV of about 126%. Also, in the three time periods (30, 90, and 180 days), the response presents positive skewness, the mean is not close to the median, and at 30 and 90 days its kurtosis is greater than three indicating that it has a leptokurtic distribution. The GINI, and LE covariates have the lowest variabilities with CV ranging between about 2% and 4%. On the other hand, the PD covariate has the most CV about at 130% and takes values on a sizeable range since the minimum and maximum are around 1p/mi 2 (referring to the Alaska state) and 1, 215p/mi 2 , respectively. The BEDS, SR, and PR covariates have close CVs varying from around 21% to 28%. Moreover, they have a mean close to the median, and kurtosis lower than three. Only the LE covariate has negative skewness. Figure 2 displays the histogram of the MR and box plots from three panel's observations, i.e., MR for 30, 90, and 180 days. The histogram and the three box plots agree to those figures in Table 2 . The MR on 30, 90, and 180 days have skewed-right distribution, and it presents some outliers. Clearly, after 90, and 180 days of the 10th recorded case, the mortality rate has increased substantially according to the box plots. Initially, we present some dispersion plots of the response variable against each covariate; see test and a non-parametric analysis. This test's null hypothesis (H 0 ) is that the populational correlation coefficient between two variables is equal to zero, i.e., there is no statistically significant correlation. Under H 0 , the computed test statistic converges in distribution to a Student's t distribution with (n − 2) degrees of freedom, where n is the sample size. The p-values of the test are given in Table 3 . In a first analysis, note that the response variable is positively correlated to PD, presenting the most correlation value with the MR regards to the other covariates (see Figure 3) . Moreover, this correlation is significant; see Table 3 . Hence, the MR increases as PD. Indeed, Rocklöv and Sjödin (2020) , the contact rate by COVID-19 is proportional to population density. Observe also that there is a statistically significant correlation between the MR and the Gini coefficient (Table 3) . A similar finding was found in Oronce et al. (2020) . In what follows it is explored more deeply the relationship between covariates and the MR through regression analysis. The goodness-of-fit measures are investigated for the RUBXII regression defined in Sect. 2 with two competitive systematic components to study the effects of the covariates given in Sect. 3.1 on the median of the mortality rate by coronavirus in the U.S. states. The well-known Kw regression (Mitnik and Baek 2013) and the UW quantile regression (Mazucheli et al. 2020 ) are considered for comparison purposes. The densities of each competitive regression's random component are given below. Let Z be a random variable that follows a Kw distribution on median-dispersion parameterization (Mitnik and Baek 2013) , say Z ∼ Kw(q, c). Then its pdf is (for z ∈ (0, 1)) where 0 < q < 1 is the median of Z and c > 0 is a dispersion parameter. Recently, Mazucheli et al. (2020) proposed the UW quantile regression. Let Z ∼ UW(q, c) be a random variable having the UW law. Then its pdf is (for z ∈ (0, 1)) where 0 < q < 1 is the τ th quantile, c is a shape parameter, and τ ∈ (0, 1) is assumed known. Here, it will be considered that τ = 0.5 to model the median of Z . Table 4 gives the estimates of the parameters and associated p values of the final fitted RUBXII, Kw, and UW regressions to the coronavirus death rates across the U.S. states. The significance of the estimates is adopted as a criterion to choose the variables in the final fits. The PR and LE covariates were not significant to the usual significance level (1%, 5%, and 10%) at all considered regressions. According to Table 4 , when RUBXII regression is fitted, most of the covariates are significant at a significance level of 1%, except for the BEDS, which is significant at 10%. Other fitted regressions do not capture the effect of the covariate BEDS. Besides, the covariate SR is also not statistically significant in the fitted UW regression. The goodness-of-fit measures of the fitted regressions given in Table 4 are reported in Table 5 . The RUBXII regression has the best adequacy measures. It presents the most LL value and p-value of SW test upper to the usual nominal level of significance. Further, its R 2 G is the greatest, indicating that the fitted RUBXII regression explains 76.57% of the median response variability. The p-value of the SW test for the Kw and UW regressions' residuals are lower than 0.05. Hence, we reject the null hypothesis that the residual distribution is normal at a significance level of 5%. Therefore, these regressions are inadequate to the current data. The p-value of the RESET-type (RES) tests indicate that all fitted regressions are specified correctly at usual significance levels. Thus, the results from Table 5 favor the RUBXII more clearly than those Kw and UW regressions by showing its superiority in terms of model fit and significance of the BEDS covariate to the mortality rates by COVID-19 in the U.S. states. Table 5 by indicating that the RUBXII regression's residuals are more close to a normal distribution since the data points are closely following the straight red line. For the other regressions, mainly the Q-Q plot from the UW regression's residuals, it is possible to note a lack-of-fit of them to the standard normal distribution. After the above analysis, there is evidence that the RUBXII regression provides a better fit quality. Therefore, from the estimates of the RUBXII regression parameters reported in Table 4 , its regression equation can be expressed as Based on the fitted RUBXII regression, some findings of the modeling mortality rate's median by COVID-19 in the U.S. states are now presented. -The PD presents a p-value lower than 0.0001, and its associated estimate is positive, which indicates that the MR is higher in states most densely populated. Similarly, Wong and Li (2020) showed that population density is an effective predictor of cumulative infection cases in the U.S. at the county level. According to this study, low population density offers a strong protective effect against COVID-19 infection. -The Gini coefficient is significant at the 1% level, and its positive estimate means that the MR increases in states with a larger Gini coefficient. This finding corroborates with the study of Oronce et al. (2020) , who noted that states with higher income inequality had experienced a higher number of deaths by COVID-19. -The number of hospital beds is significant at the 10% level. The mortality rate's median decreases when the total hospital beds per 100 thousand inhabitants increase as expected. According to Janke et al. (2021) U.S. geographic areas with fewer intensive care unit beds, nurses, and general medicine/surgical beds per COVID-19 case were statistically significantly associated with greater deaths in April. -The SR is mightily significant ( p-value = 0.0003). The mortality rate's median increases as the SR grows according to the positive signal of its related estimate. This result is expected since the immune response of smoking patients decreases potentially (Taghizadeh-Hesary and Akbari 2020). -The dummy variables related to the time 90 and 180 days after the 10th confirmed case are significant as expected. As indicated by the box plots in Fig. 2 , the MR grows steadily during the considered periods. The COVID-19 characterizes a pandemic that has been spread across the United States of America (U.S.) since January 2020. This paper investigates how demographic, socioeconomic, health care resources, and behavioral variables are related to the mortality rate by COVID-19 in the U.S. states. To properly reach that aim, it is chosen regressions that consider the double-bounded characteristic of the mortality rate. It is introduced an alternative model called the reflected unit Burr XII (RUBXII) regression, which is a helpful tool for modeling bounded random variables in the interval (0, 1), such as rates, proportions, and indexes. This proposal is based on a new unit continuous distribution that arises from a transformation on a random variable Burr XII distributed. Further, a more general and useful quantile-parameterization is introduced to define the quantile regression for unit data. The estimation of the parameters, a simulation study to evaluate the maximum likelihood estimators' performance and some adequacy measures to check whether the regression's assumptions hold are discussed. After consolidating the data set about the mortality rates and other covariates for the U.S. states, a descriptive statistical analysis and regression modeling are done. In this way, the new regression is compared with the Kumaraswamy and unit-Weibull regressions. The proposed regression is quite competitive compared with those regressions and provides the best fit according to some selection criteria. Thus, from the fitted RUBXII regression, it is possible to identify that the population density, Gini coefficient, hospital beds, and smoking rate are statistically significant in modeling the mortality rate's median by COVID-19 in the U.S. states. This paper's findings may improve understanding of coronavirus in the U.S. and help healthcare system better prepare for the advance of the pandemic or even respond to similar epidemics. Interested readers can access all computational codes at https:// github.com/tatianefribeiro/RUBXII_Regression_COVID-19/tree/master. Since the RUBXII regression's potentiality to analyze coronavirus data, it is aimed in future research to fit this regression to the mortality rates by coronavirus in other countries of the world In this appendix, it is determined the score vector of the log-likelihood function given by Eq. (9). It is obtained from the first derivative of the log-likelihood function with respect to the k + 1 unknown parameters which compose the vector θ . That is, it is defined as with j = 1, . . . , k. To simplify the notation, the following quantities are considered: where s(x) = log [ log (1 − x) −1 ]. Observe that Hence, the score vector's components can be written compactly in matrix notation as U ξ (θ ) = X T a and U c (θ ) = b 1, where X is an n × k covariates matrix, whose ith row is x i = x i = (x i1 , . . . , x ik ), T = diag{1/g (q 1 ), . . . , 1/g (q n )}, a = (a 1 , . . . , a n ) , b = (b 1 , . . . , b n ) , and 1 is an n-dimensional vector of 1s. Early evidence on social distancing in response to COVID-19 in the United States Correlation between climate indicators and COVID-19 pandemic United States COVID-19 Cases and Deaths by State over Time Report of clustering pneumonia of unknown etiology in Wuhan City County Health Rankings & Roadmaps The impact of non-pharmaceutical interventions, demographic, social, and climatic factors on the initial growth rate of COVID-19: A cross-country study Randomized quantile residuals Isolation and characterization of viruses related to the SARS coronavirus from animals in southern China The unit extended Weibull families of distributions and its applications A new unit distribution based on the unbounded Johnson distribution rule: the unit Johnson SU distribution Clinical features of patients infected with 2019 novel coronavirus in Wuhan Analysis of Hospital Resource Availability and COVID-19 Mortality Across the United States Hospital Beds per 1,000 Population by Ownership Type A new heavy-tailed distribution defined on the bounded interval: the logit slash distribution and its application The unit generalized half normal distribution: a new bounded distribution with inference and application On the unit Burr-XII distribution with the quantile regression modeling and applications On the arcsecant hyperbolic normal distribution. Properties, quantile regression modeling and applications The unit-Weibull distribution as an alternative to the Kumaraswamy distribution for the modeling of quantiles conditional on covariates The Kumaraswamy distribution: median-dispersion re-parameterizations for regression modeling and simulation-based estimation GIS-based spatial modeling of COVID-19 incidence rate in the continental United States A novel coronavirus emerging in China-key questions for impact assessment A note on a general definition of the coefficient of determination Association between state-level income inequality and COVID-19 cases and mortality in the USA Detecting model misspecification in inflated beta regressions Kumaraswamy regression model with Aranda-Ordaz link function R: a language and environment for statistical computing. R Foundation for Statistical Computing Brain imaging use and findings in COVID-19: a single academic center experience in the epicenter of disease in the United States High population densities catalyse the spread of covid-19 The powerful immune system against powerful COVID-19: a hypothesis Spreading of COVID-19: density matters US States by gini coefficient Poverty Rate by State 2021 Smoking Rates by State 2021 World Population Review (2020c) US States -Ranked by Population Spatial disparities in coronavirus incidence and mortality in the United States: an ecological analysis as of Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations The authors gratefully acknowledge the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) by financial support. The authors declare that they have no conflict of interest.