key: cord-0942104-4cz3nc4b authors: Xiong, D.; Zhang, L.; Watson, G. L.; Sundin, P.; Bufford, T.; Zoller, J. A.; Shamshoian, J.; Suchard, M. A.; Ramirez, C. M. title: Pseudo-Likelihood Based Logistic Regression for Estimating COVID-19 Infection and Case Fatality Rates by Gender, Race, and Age in California date: 2020-07-01 journal: nan DOI: 10.1101/2020.06.29.20141978 sha: 8e1f5fe0389d622326ad7422cf7ecbfdfd6c8f79 doc_id: 942104 cord_uid: 4cz3nc4b In emerging epidemics, early estimates of key epidemiological characteristics of the disease are critical for guiding public policy. In particular, identifying high risk population subgroups aids policymakers and health officials in combatting the epidemic. This has been challenging during the coronavirus disease 2019 (COVID-19) pandemic, because governmental agencies typically release aggregate COVID-19 data as marginal summary statistics of patient demographics. These data may identify disparities in COVID-19 outcomes between broad population subgroups, but do not provide comparisons between more granular population subgroups defined by combinations of multiple demographics. We introduce a method that overcomes the limitations of aggregated summary statistics and yields estimates of COVID-19 infection and case fatality rates --- key quantities for guiding public policy related to the control and prevention of COVID-19 --- for population subgroups across combinations of demographic characteristics. Our approach uses pseudo-likelihood based logistic regression to combine aggregate COVID-19 case and fatality data with population-level demographic survey data to estimate infection and case fatality rates for population subgroups across combinations of demographic characteristics. We illustrate our method on California COVID-19 data to estimate test-based infection and case fatality rates for population subgroups defined by gender, age, and race and ethnicity. Our analysis indicates that in California, males have higher test-based infection rates and test-based case fatality rates across age and race/ethnicity groups, with the gender gap widening with increasing age. Although elderly infected with COVID-19 are at an elevated risk of mortality, the test-based infection rates do not increase monotonically with age. LatinX and African Americans have higher test-based infection rates than other race/ethnicity groups. The subgroups with the highest 5 test-based case fatality rates are African American male, Multi-race male, Asian male, African American female, and American Indian or Alaska Native male, indicating that African Americans are an especially vulnerable California subpopulation. Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has spread from its zoonotic origins in Hubei Province, China, causing a global pandemic of coronavirus disease 2019 (COVID-19) [1, 2] . As of June 12, 2020, COVID-19 has infected over 7 million people across 188 countries and regions [3] . In the 5 early stages of an emerging epidemic such as COVID-19, estimating the infection rate (IR) and case fatality rate (CFR) of the infectious disease is of utmost importance to health officials, policy makers, and the population at large. Accurate population and subgroup estimates of CFRs provide an evidence-based rationale for policies designed to mitigate the spread of the infectious disease, help identify disparities in disease vulnerability, and inform resource allocation to communities in greatest need. Official COVID-19 data released by governmental health agencies and other public sources are prohibited by U.S. law from containing personally identifiable information. Consequently, these data are generally summarized in an 15 aggregate format that comprise only marginal or limited bivariate summary statistics of patient demographics, providing valuable but limited information on the heterogeneity of patient attributes. Indeed, in New York City, the epicenter of the COVID-19 outbreak in the U.S., the reported infection rates and case fatality rates for African Americans were disproportionately higher than 20 other races, according to data released by the New York City Department of Health and Mental Hygiene [4] . Data from several other U.S. states, including New Jersey[5], California[6], and Illinois[7], exhibited similar trends. Gender and age-disaggregated national case data from a vast array of countries across the globe reveal that males and older individuals generally have substantially 25 higher case fatality rates. Furthermore, evidence from numerous clinical studies of COVID-19 risk factors have established that gender and age are risk factors for COVID-19 infection mortality [8, 9, 10, 11] . However, by aggregating, data from governmental health agencies or other public sources do not provide granular information on the combined effect of the risk factors under consideration. 30 In particular, how IRs and CFRs vary across population subgroups characterized by gender, age, and race jointly has not yet received substantial attention. Understanding the gender-age-race dynamics of COVID-19 infection and mortality would provide deeper insights into the disparities that exist in the effects of COVID-19 on the population. 35 Various methods for using information contained in aggregate data have been proposed in a wide array of applications [see, e.g. 12, 13, 14] , and there is growing interest in leveraging marginal summary statistics in publicly released COVID-19 datasets to quantify the impact of various risk factors on COVID-19 mortality [15] . In this paper, we propose a method that helps overcome and fatalities to obtain early estimates of COVID-19 IRs and CFRs for population subgroups defined by combinations of risk factors.A major difference between the prevalent approaches to analyzing marginalized data and our proposed method is that we incorporate multivariate population demographic data, 45 which provides estimates of the joint probability distribution of risk factors for the disease. Specifically, we propose a pseudo-likelihood based multivariable logistic regression approach that combines publicly released aggregate COVID-19 case and fatality data with multivariate population-level demographic survey data. The proposed method is composed of two main steps. First, we model We combined the last two race and ethnicity groups due to their small size in the California population. Race and ethnicity are missing in almost 29% of confirmed cases and 1% of the deaths. The CDPH data also provide the number of COVID-19 cases and fatalities by race and age jointly [6], the only state in the U.S. doing so at the time of writing. To supplement the CDPH COVID-19 data, we used demographic data on CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 1, 2020. 6 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 1, 2020. . graphic and health variables. CHIS oversamples certain population subgroups to achieve more reliable and precise estimates for these subgroups, and estimates a sampling weight for each respondent to represent the reciprocal of the probability of selection. We use the 2017-2018 wave of CHIS in our analysis, 105 which consists of 45,369 subjects interviewed, focusing on the following three demographic variables recorded: gender, age, and race and ethnicity groups. We propose estimating COVID-19 T-IRs given gender, age and race using a multivariable logistic regression model. The variables we use in our analysis are listed in Table 2 . Letting age 0-17, female, and LatinX be the reference categories, we let z ∈ {0, 1} p ∈ R p denote the gender-age-race covariate setting of the covariates Z in Table 2 , where p is 15. The postulated IR model follows where I ∈ {0, 1} represents infection status, γ 0 is the log odds of infection for We introduce P X (x) as the probability mass function of a p * -dimensional discrete random variable X with support X ⊂ {0, 1} p * , p * > p, that represents the proportion of the California population with gender-age-race attributes x, 7 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 1, 2020. which is simply an augmentation of the covariate setting z in (1) to include the reference levels listed in Table 2 . In other words, there exists a bijection z = z(x) from the X-space to the Z-space. We then define the conditional probability mass function of X −i given where X i is the i th element of X, and X −i is the subset of X that omits X i . Defining X (Xi=1) to be the subset of X with the constraint that X i = 1 and taking the expectation of both sides of Equation (2) conditional on X i = 1, by the Law of Iterated Expectations we have Next, we construct the individual pseudo-log-likelihoods corresponding to each univariate logistic regression of I on X i = 1 for each X i ∈ X. Let N 8 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 1, 2020. . denote the total population size, N i1 denote the number of individuals in the population with X i = 1, and N (I) i1 denote the total number of individuals with X i = 1 who have been or will be infected with COVID-19. Therefore, N follows a binomial distribution, for i = 1, . . . , p * . We define the individual pseudo-log-likelihood of (γ 0 , γ) for , N i1 ), and we define the full pseudo-log-likelihood of (γ 0 , γ) as the sum of the individual pseudo-log-likelihoods We use the CHIS data to approximate P X (x), which we denoteP X (x). Let N (I) denote the total number of individuals in the population who have been or will be infected with COVID-19, and let π I = P(I = 1) denote the 115 overall infection rate in the population. Thus, the total population size is N = N (I) /π I . From the CDPH data presented in Table 1 , we have the cumulative number of reported COVID-19 infections as of June 12, 2020, which we denotê N (I) . BecauseN (I) measures the cumulative number of COVID-19 infections up to June 12, 2020, and increases daily,N (I) is smaller than N (I) , perhaps 120 substantially. Furthermore, π I is unknown, and for a given estimateπ I of π I , we defineN to beN =N (I) /π I . Therefore, even for accurate estimates of π I , N will be smaller, perhaps substantially, than the total number of individuals in the population. However, we assume here that the relative size ofN to N is approximately equal to the relative size ofN (I) to N (I) . Hence,N may be 125 interpreted as an appropriately scaled version of N with respect toN (I) andπ I as of June 12, 2020. Likewise, having the same interpretation asN but for the subset of the population with to be the cumulative number of infected individuals 9 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 1, 2020. . https://doi.org/10.1101/2020.06.29.20141978 doi: medRxiv preprint with X i = 1 as of June 12, 2020, and present {N (I) Table 1 . We maximize the approximate pseudo-likelihood (6) with respect to (γ 0 , γ) to obtain our estimates Lastly, by plugging (γ 0 ,γ) into Equation 2, we obtain the predicted test-based infection probabilities for individuals with gender-age-race covariate setting ẑ Similar to the T-IR estimation method, we model the T-CFRs given gender, age, and race using a multivariable logistic regression model. The gender-agerace covariate we use for CFR estimation (see Table 3 ) is the same as the covariate we use for IR estimation, except that we combined the 0-17 and 18-34 age groups due to low numbers of fatalities among the 0-17 age group. With a slight abuse of notation, we denote z ∈ {0, 1} q ∈ R q to be the covariate setting of the vector of non-reference group covariates Z, where q = 14. The corresponding random variable X and its covariate setting x are as defined in the preceding subsection and have dimension q * , where q * = 17. We give the T-CFR model as where M ∈ {0, 1} represents mortality status, δ 0 is the log odds of mortality for the LatinX female age 0-34 group, and δ ∈ R q are the log odds ratios of mortality for other covariate settings. . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 1, 2020. We again employ a pseudo-likelihood approach to estimate (δ 0 , δ) that maximizes a likelihood function constructed from univariate logistic regression models. Following similar steps as shown in the preceding subsection, we have We use the CHIS data and the IR model (1) with coefficient estimates (7) to estimate P X−i|Xi (x −i |x i = 1, I = 1). First, we estimate P(z|I = 1) using whereP(I = 1|z) comes from Equation (8), andP X (x(z)) is obtained from the CHIS dataset. Then, by the definition of conditional probability, we estimate 11 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 1, 2020. . https://doi.org/10.1101/2020.06.29.20141978 doi: medRxiv preprint whereP(z(x)|I = 1) comes from Equation (11) . Analogous to the IR model, we denote N to be the number of individuals with X i = 1 who have died or will die from COVID-19. Therefore, {N We then construct the full pseudo-log-likelihood of (δ 0 , δ) as the sum of the individual pseudo-log-likelihoods of (δ 0 , δ) for X i corresponding to binomial distribution (13) From the CDPH data presented in Table 1 , we have the cumulative number of COVID-19 deaths by gender, age, and race. We denoteN for P X−i|Xi (x −i |x i = 1), i = 1, ..., q * , in the pseudo-likelihood (14), we obtain an approximate pseudo-likelihood and maximize it with respect to (δ 0 , δ) to obtain our estimates Lastly, from δ 0 ,δ , we can obtain the predicted COVID-19 test-based case fatality rates for individuals with gender-age-race covariate setting z, 12 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 1, 2020. . https://doi.org/10.1101/2020.06.29.20141978 doi: medRxiv preprint To quantify the uncertainty of the T-IR and T-CFR estimates in (8) and (17) Similarly, the third stage introduces variation in the {N . ( The entire Monte Carlo simulation procedure can be summarized in 5 steps: Monte Carlo Simulation Procedure Step 1: Bootstrap the CHIS dataset with selection probabilities proportional to the sampling weights. Step 2: Perform the T-IR estimation procedure, simulating a value ofN (I) i1 from (18) for each X i , subsequently obtaining estimates (γ 0 ,γ). Step 3: Using (γ 0 ,γ) obtained in Step 2, perform the T-CFR estimation procedure, simulating a value ofN (M) i1 from (19) for all X i , subsequently obtaining estimates δ 0 ,δ . Step 4: Repeat Steps 1-3 to obtain a total of B estimates of (γ 0 , γ) and (δ 0 , δ), which we denote γ Step 5: For each set of bootstrap coefficient estimates γ . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 1, 2020. . https://doi.org/10.1101/2020.06.29.20141978 doi: medRxiv preprint In Figure 1 , We illustrate the Monte Carlo simulation procedure in a flow chart. In addition to estimating the T-IR and T-CFR for specific covariate settings z through Equations (8) and (17) (Jr) denotes the subset of X with the constraint that X j1 = c j1 ,...,X jr = c jr . Estimates of collapsed T-IRs given X j1 = c j1 ,...,X jr = c jr can be obtained using the marginalization formulâ . (20) Likewise, collapsed estimates of T-CFRs given X j1 = c j1 ,...,X jr = c jr can be obtained using the marginalization formulâ 14 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 1, 2020. . https://doi.org/10.1101/2020.06.29.20141978 doi: medRxiv preprint whereP(z(x)|I = 1) comes from Equation (11). We present select estimates of T-IRs and T-CFRs obtained from our IR (1) and CFR (8) models fit to the California data described in Section 2 and sum- American Indian or Alaska Native (AIAN). LatinX has the highest T-IRs, followed by African American. Both females and males age 80 and older have extremely high T-IRs compared with other age groups across race/ethnicity groups. T-IRs were non-monotonic at younger ages, with age groups 60-64 and 70-74 having slightly lower T-IRs than the preceding age groups, 50-59 and 180 65-69 respectively. Males had higher T-IRs than females across all age groups with the gender gap slightly increasing with age. We also considered alternate values for the overall California IR. Table 4 presents the point estimates and associated two SE intervals of the marginal T-IRs for gender and age group obtained from marginalization formula (20) 185 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 1, 2020. . https://doi.org/10.1101/2020.06.29.20141978 doi: medRxiv preprint assuming overall IRs of 1%, 2%, and 5%. The estimated marginal T-IRs for gender, age groups and race and ethnicity groups are consistent with the results presented in Figure 2 , including males and older individuals having higher estimated T-IRs. . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 1, 2020. Among 6 race and ethnicity groups, African American, Asian, and White are high-risk groups with mean T-CFRs as 7.42%, 6.80%, and 6.78% respectively. . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 1, 2020. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 1, 2020. . https://doi.org/10.1101/2020.06.29.20141978 doi: medRxiv preprint age groups based on the stratified results. African American female even has a higher of T-CFRs than AIAN, LatinX and White male for each age group correspondingly. Although Multi-race has higher T-CFRs at each age groups than White, the overall marginal T-CFRs for Multi-race is lower than White as shown in Table 210 5. The reversal of the inequality between the size ratios is an example of Simp- In this paper, we combined aggregate COVID-19 case and fatality data with population demographic data in a pseudo-likelihood based multivariable logistic regression approach for obtaining early estimates of COVID-19 T-IRs and T-CFRs for subgroups of the California population. Overall, our results revealed that males, the elderly, and LatinX are marginally at relatively higher risk of 230 COVID-19 infection, and that males, the elderly, Africa Americans, Asians, and Whites are marginally at elevated risk of mortality after COVID-19 infection. However, due to the imbalance in the age distribution of different races in California, the subgroups with the top 5 T-CFRs are Africa American male, . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 1, 2020. . https://doi.org/10.1101/2020.06.29.20141978 doi: medRxiv preprint Figure 3 : Estimated test-based case fatality rates by age and gender, stratified by race/ethnicity. The bootstrap mean case fatality rates are presented separately for LatinX, White, Asian, African American, Multi-Race, and AIAN groups. The overall infection rate was assumed to be 2%, and the error bars denote two bootstrap standard errors. Multi-race male, Asian male, African American female, and AIAN male for each 235 age group. Overall, therefore, African Americans are the race/ethnicity group most vulnerable to COVID-19 in California. We also found that the elevated infection and mortality risk for males and the greater mortality risk for all races increase with age. The proposed methods are subject to three general limitations. First, the 240 analysis is based on publicly available test-based infection rates and case fatality 20 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 1, 2020. . https://doi.org/10.1101/2020.06.29.20141978 doi: medRxiv preprint rates. It has been well documented that the lack of testing for COVID-19 in the U.S. has hindered efforts to estimate the true COVID-19 infection rate. Further compounding this issue is the high prevalence of asymptomatic COVID-19 cases. These two issues may lead to substantial underestimates of the infec-245 tion rates and/or substantial overestimates of the case fatality rates from our analyses. Second, race/ethnicity is missing in 29% of the reported cases from CDPH, which may bias our estimates. Even though CDPH releases summary statistics for age-race covariates and our model can fit the finer data, we fit the marginal statistics for each risk factor in the analysis to minimize the impact Another promising avenue for future work is combining this method with a COVID-19 prediction model [24] to provide detailed demographic projections of COVID-19 cases and mortalities. This would be a substantial improvement 265 over most COVID-19 prediction models, as they tend to be quite limited in their ability to forecast the demographic characteristics of the infected. In summary, this paper provides a pragmatic tool for producing early estimates of COVID-19 T-IRs and T-CFRs for the California population, which offer valuable information to guide health policies concerning the control and 270 prevention of COVID-19. In addition, our methods can be generalized into a general framework for early estimation of subpopulation IRs and CFRs from 21 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 1, 2020. . https://doi.org/10.1101/2020.06.29.20141978 doi: medRxiv preprint aggregate case and fatality data in other locations and for future epidemics. . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 1, 2020. . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 1, 2020. . https://doi.org/10.1101/2020.06.29.20141978 doi: medRxiv preprint The novel coronavirus originating in wuhan, china: challenges for global health governance Who declares covid-19 a pandemic global cases by the center for systems science and engineering (csse) at johns hopkins university (jhu), ArcGIS Hospitalization Rates and Characteristics of Patients Hospitalized with Laboratory-Confirmed Coronavirus Disease MMWR. Morbidity and Mortality Weekly Report 69 New Jersey Department of Health, COVID-19 Information Hub COVID-19 Race and Ethnicity Data Risk factors of critical & mortal COVID-19 cases: 310 A systematic literature review and meta-analysis Gender Differences in Patients With COVID-19: Focus on Severity and Mortality Features of 20 133 uk patients in hospital with covid-19 using the isaric who clinical characterisation proto Predictors of mortality for patients with covid-19 pneumonia caused by sars-cov-2: a prospective cohort study A practical introduction to multivariate metaanalysis A general framework for the use of logistic regression models in meta-analysis Logistic regression in meta-analysis using aggregate data Estimation of risk factors for covid-19 mortality-preliminary results, medRxiv Cases and Deaths Associated with COVID-19 by Age Group in California COVID-19-Cases-by-Age-Group.aspx Public Use File Assessing the global tendency of covid-19 outbreak Preliminary estimation of the basic reproduction number of novel coronavirus (2019-ncov) in china, from 2019 to 2020: A data-driven analysis in the early phase of the outbreak, International journal of infectious diseases Covid-19 antibody seroprevalence in Universal screening for sarscov-2 in women admitted for delivery Fusing a bayesian case velocity model with random forest for predicting covid-19 in the us We thank Dr. Sudipto Banerjee and Jay J. Xu (University of California, Los 275 Angeles) for their many helpful comments and assistance. The authors received no specific funding for this work. The authors have declared that no competing interests exist.