key: cord-0900153-pulat2w4 authors: Kaciroti, Niko A.; Lumeng, Carey; Parekh, Vikas; Boulton, Matthew L. title: A Bayesian Mixture Model for Predicting the COVID-19 Related Mortality in the United States date: 2021-02-19 journal: Am J Trop Med Hyg DOI: 10.4269/ajtmh.20-1147 sha: a2f645770f675cedde414a93fe847bd16853a52d doc_id: 900153 cord_uid: pulat2w4 An outbreak of SARS-CoV-2 has led to a global pandemic affecting virtually every country. As of August 31, 2020, globally, there have been approximately 25,500,000 confirmed cases and 850,000 deaths; in the United States (50 states plus District of Columbia), there have been more than 6,000,000 confirmed cases and 183,000 deaths. We propose a Bayesian mixture model to predict and monitor COVID-19 mortality across the United States. The model captures skewed unimodal (prolonged recovery) or multimodal (multiple surges) curves. The results show that across all states, the first peak dates of mortality varied between April 4, 2020 for Alaska and June 18, 2020 for Arkansas. As of August 31, 2020, 31 states had a clear bimodal curve showing a strong second surge. The peak date for a second surge ranged from July 1, 2020 for Virginia to September 12, 2020 for Hawaii. The first peak for the United States occurred about April 16, 2020—dominated by New York and New Jersey—and a second peak on August 6, 2020—dominated by California, Texas, and Florida. Reliable models for predicting the COVID-19 pandemic are essential to informing resource allocation and intervention strategies. A Bayesian mixture model was able to more accurately predict the shape of the mortality curves across the United States than other models, including the timing of multiple peaks. However, given the dynamic nature of the pandemic, it is important that the results be updated regularly to identify and better monitor future waves, and characterize the epidemiology of the pandemic. An outbreak of SARS-CoV-2 has led to worldwide spread, affecting virtually every country globally with approximately 25,500,000 confirmed cases and more than 850,000 deaths as of August 31, 2020 based on WHO reporting. 1 In the United States alone (50 states plus District of Columbia), there have been more than 6,000,000 confirmed cases and 183,00 deaths as of the end of August. The pandemic has resulted in substantial disruptions to people's lives. At various points, more than 3 billion people throughout the world have been under various lockdown orders. 2 The application of these orders has varied widely across countries and within countries. 3 Public health countermeasures to interrupt and control transmission rely on predictive models and an understanding of disease dynamics. 4, 5 Because disease is acquired through asymptomatic transmission in most of the cases, the epidemiology of the pathogen is not fully understood. It has been firmly established that it is readily transmitted via droplet nuclei, but airborne transmission also likely plays a role. 6, 7 Understanding the transmission dynamics of the pandemic is crucial for informing decisions regarding resource allocation, in instituting control measures, and in assessing their effectiveness as a mitigation strategy. The COVID-19 pandemic has spawned the development of a large number of predictive mathematical models. 8 Two commonly used approaches are based on transmission models [9] [10] [11] and curve-fitting models. 12 Transmission models simulate how quickly an infection can spread in a community that is immunologically naive, based on a number of initial assumptions. Although such models are useful, they are based on parameters that are hard to determine, and therefore are sensitive to initial values and assumptions. Consequently, the results can vary greatly (Imperial College model 10 versus Oxford model 11 ) and substantially overestimate or underestimate the full extent of an outbreak. 13 Curve fitting models use available COVID-19 data to determine if trends exist, and project future disease trajectories by extrapolation (Institute for Health Metrics and Evaluation [IHME] COVID-19 health service utilization forecasting team 12 ). Although such models can be useful for short-term prediction, their long-term projections will vary depending on the type of the curve used for extrapolation in addition to the impact of new and unforeseen factors, including, for example, the development of effective vaccines, virus mutations, or changes in government interventions including stay-at-home orders or prematurely phasing out of such orders. We use data from the United States to inform and monitor COVID-19 mortality in 50 states and the District of Colombia using a Bayesian curve fitting model. Data on the number of confirmed cases are confounded by changes in test availability. We used mortality data as a more reliable measure for modeling pandemic progress over time. Our approach uses a Bayesian modeling framework 14 for modeling and predicting daily mortality, and subsequently deriving the cumulative mortality projections over time. The Bayesian framework allows for updating prior knowledge about the quantity of interest using the observed data and calculates posterior distribution for the quantity of interest. Bayesian inferences are derived from the posterior distributions of quantities of interest, which are used for projections and their corresponding credible intervals. In addition, the Bayesian framework provides computational power via the Markov chain Monte Carlo methodology to provide exact estimate of the quantity of interest, rather than using approximate optimization algorithms. The Bayesian model is applied to COVID-19 mortality data in the United States, but can be used in a similar manner for predicting other COVID-19 measures, including the number of confirmed cases, the number of COVID-19-related hospitalizations, and healthcare utilizations. Curve fitting models are useful mathematical models for predicting the trajectory of pandemics over time. 8 A commonly used curve for such models is a bell-shaped curve defined by the Gaussian function: fðtÞ = PpexpðÀ½t À t peak 2 = ½2σ 2 Þ, where t peak is the time to peak, P is the magnitude of the peak, and σ captures the width of the curve. Alternatively, f(t) can be expressed as fðtÞ = ðP ffiffiffiffiffiffi 2π p σÞ 1 ffiffiffiffi ffi 2π p σ expðÀ½t À t peak 2 =½2σ 2 Þ = Mpfðt; t peak σÞ, where M = P ffiffiffiffiffiffi 2π p σ is the overall expected mortality and f(t; t peak, σ) is the normal density function, which governs how the overall mortality M is distributed over time; that is, expected mortality at time t is M*f(t; t peak, σ). The initial IHME model 12 uses a Gaussian sigmoidal function for forecasting the cumulative mortality data (which is equivalent to a Gaussian function for the daily mortality data). In the IHME model, the parameters of the curvature function are estimated based on a regression model with a normal distribution on the log-transform cumulative deaths as outcome. 12 For a curve fitting model to perform well, the shape of the curve chosen is important; it must have a sound fit with the observed data and an appropriate underlying theoretical justification. The curve defined by a Gaussian function has a symmetric shape around the peak, and future trends are forecasted by extrapolating the observed trend forcing the postpeak component of the trajectory to be symmetric with the pre-peak portion. These forecasts assume that factors related to the pandemic do not change over time, thus are suitable when the spread of the pandemic is relatively homogeneous with no mitigating measures of behavioral changes. Although the assumption of homogeneity over time might be reasonable over short time periods, it is unrealistic for long time periods and large areas, as changes in the pandemic progression will result in behavioral changes at the individual level, and policy and practice changes at the local, state, and federal levels (i.e., social distancing and stay-at-home orders). Such changes have the potential to alter, sometimes substantially, the natural arc of the disease course. To account for such changes, we propose a model where the curve for the daily mortality trajectory μ(t) at day t is expressed as a mixture of multiple homogeneous sub-curves. Each sub-curve can be seen as implicitly capturing a sub-epidemic or a specific trend in the trajectory during a given time period, which relates to underlying factors (known or unknown) for that span of time. This is particularly relevant for the COVID-19 pandemic in the United States, where, for example, changes in compliance with social distancing in late May early June 2020 resulted in a surge of cases in a large number of states. Specifically, let the mortality at day t be comprised as yðtÞ = + K i = 1 y i ðtÞ, where y i (t) represents the number of deaths at day t attributed to a surge or sub-epidemic indexed by i = 1,. . .,K, K ³ 1 is the number of sub-epidemics empirically identified, and E(y i [t]) = μ i (t). The trajectory μ i (t) for surge i can be modeled by a homogeneous Gaussian function μ i (t) = M i *f(t; t peak,i, σ i ), where M i is the overall mortality, t peak,i is the time to peak, and σ i represents the width for surge i. The mortality curve for the whole pandemic E(y[t]) = μ(t) is then decomposed as: Let M = + K i = 1 M i be the overall mortality of the pandemic, then π i = M i =M is the proportion of all deaths attributed to surge i, with M i = M π i . Therefore, we proposed the following mixture curve model for modeling the trajectory of daily COVID-19 mortality: where M is the overall mortality and f(t; t peak , i σ i ) represents the part of the curve related to some underlying factor(s), π i is the proportion of total deaths that are attributed to such factor(s) indexed by i, and K is the number of sub-curves comprising the mixtures. The number of parameters identifying the curve is 3K: M, π 1 . . ., π K−1, ( t peak , 1 σ 1 ), . . ., (t peak , K σ K ). The f(t; t peak,i σ i ) here is the Gaussian density function, defined by the location parameter t peak,i , which represents the time from the first death to apex, and the scale parameter σ i , which represents the spread of the curve. We propose a Bayesian approach for estimating the curve parameters and subsequently any other statistics of interest, with values for the parameters being drawn from their posterior distribution condition on the observed data. We assume the distribution for the observed mortality data y(t) at time t to be negative binomial: where N is the state's population size. A negative binomial distribution function is used for Y(t), as it is appropriate for count data, with EðY½tÞ = μðtÞNp t =ð1 À p t Þ. Bayesian Markov Chain Monte Carlo methods are used to make draws from the posterior distribution of the unknown parameters and derive future projections. Weakly informative prior distributions were used for all model parameters 15 (see Supplemental Material). The number of mixtures K is selected based on choosing a parsimonious model having lower deviance information criteria. The resulting shape of the curve μ(t) can be multimodal or unimodal. For a multimodal curve, the proposed modeling identifies multiple surges (i.e., Sun Belt States, California, Florida, Texas). For unimodal curves, the proposed modeling can accommodate skewed long-tailed distributions, 16 where the curve comprises multiple sub-curves, but is dominated by one major surge or multiple surges occurring closely spaced in time (e.g., New York, New Jersey, Michigan). In this section, we apply the proposed model to the COVID-19 mortality data in the United States (www.worldometers.info/ coronavirus/). Because of variation in the number of death records by day of the week (particularly on weekends), we used a weekly 7day moving average (± 3 days window) as a more reliable measure for daily mortality. All analyses were performed in SAS 9.4 (SAS Institute Inc., Cary, NC) 17 using PROC MCMC. We ran 500,000 iterations following 10,000 burn in and 1,000 thinnings to reduce high autocorrelation. The expected number of deaths for each day was derived based on the corresponding posterior distribution. Based on the data available as of August 31, 2020, for most of the states, a mixture of K = 2 sub-curves showed a good and parsimonious fit of the data, with no substantial improvement for K > 2. For Arizona, Louisiana, Massachusetts, Michigan, New York, South Carolina, Tennessee, and Virginia, a mixture of K = 3 sub-curves improved the model fit, with no substantial improvement for K > 3. For the entire country, we derived the curve for daily mortality as the sum of estimates from each state plus the District of Columbia. Figure 1 displays the daily mortality curves for each state and District of Colombia and for the entire country. There is variability among curves in the context of their shape, magnitude, and timing of the pandemic. Daily mortality and cumulative mortality curves separately for each state and the entire nation are provided in Supplemental Figure 1 . Most of the states demonstrate a bimodal curve with two major peaks. The first peak represents the initial dynamic of the pandemic, following the introduction of control measures in March, and the second or third peak captures the surge that occurred in many states after measures were phased out to varying degrees. The estimated date of the peak for each state and the United States is shown in Table 1 . The first peak date among states varied between April 4, 2020 for Alaska, and June 18, 2020 for Arkansas. Thirty-one states have a clear bimodal curve, with However, as control measures were relaxed, many states had a second surge; for example, California, Texas, and Florida are currently experiencing their second major outbreak. The proportion of two mixtures, π 1 and π 2 = 1 − π 1 , are also shown in Table 1 . Most of the states are characterized by the second outbreak, dominating the curve. For example, the majority of deaths through September 30, 2020 for California (π 2 = 77% versus π 1 = 23%), Florida (π 2 = 77% versus π 1 = 23%), and Texas (π 2 = 85% versus π 1 = 15%) are projected to occur during the second surge. For other states, the majority of deaths occurred during the first peak, with no new major surge projected through September 30, 2020, that is, New York, (π 1 = 52% and π 2 = 36%, versus π 3 = 12%), New Jersey (π 1 = 65% versus π 2 = 35%), Massachusetts (π 1 = 45% and π 2 = 37% versus π 3 = 18%), and Michigan (π 1 = 48% and π 2 = 33% versus π 3 = 19%). The majority of deaths for the entire country are projected during the second surge (π 2 = 56% versus π 1 = 44%); however, the first peak was more severe-more than 2,250 deaths/day-whereas the second surge has a lower peak-around 1,200 deaths/day-but is of longer duration. Next, we evaluate the performance of the proposed Bayesian mixture model by comparing the projections based on the proposed Bayesian model to projections based on the widely used IHME hybrid model 18 (http://www.healthdata.org/ covid/data-downloads) updated on August 27, 2020, representing the last update in August. Our projections are derived on August 31, 2020; however, following the revision of our article, the mortality data as of September 30, 2020 have become available; therefore, we include these data in Tables 1 and 2 . Consequently, we also evaluate the performance of each projection based on the Bayesian model and IHME model by calculating the bias and the mean square error (mean square error ½MSE % bias 2 + ½fupper bound À lower boundg=4 2 ). The results from both models are similar, with the Bayesian model having lower bias and MSE in 35 of 52 projections for September 15, 2020, and lower bias in 35 and lower MSE in 38 of 52 projections for September 30, 2020. The average (median, IQR) bias for September 15 and September 30 projections were 2% (median = 0.4%, IQR = 1-2.25%) and 5.8% (median = 1.45%, IQR = 2.8-7.96%), respectively, based on the Bayesian model versus 5.5% (median = 0.8%, IQR = 1.65-6.0%) and 10.6% (median = 1.75%, IQR = 4.5-15.5%) based on the IHME hybrid model. The novel coronavirus, SARS-CoV-2, has caused an unprecedented global public health crisis, with the pandemic spreading to virtually every country worldwide in less than a year and accompanied by overwhelming levels of related morbidity and mortality. Predictive models continue to have a fundamental role to play in estimating the future burden of disease and in informing the allocation of critical laboratory, medical, and public health resources needed to successfully interrupt and eventually control the pandemic. We propose a Bayesian mixture model, which can capture multiple surges or sub-epidemics attributed to a number of different underlying factors, including the introduction and phasing out of control measures. As of August 31, 2020, a combination of two or three subcurves provided a parsimonious good fit for modeling daily mortality curve among all states in the United States through September 30, 2020. The results showed a second surge for some states and a prolonged recovery for others. For many states (e.g., Arizona, California, Florida, and Texas), most of the cases occurred in the second or third peak characterized by a major surge starting in late July. Other states experienced only a single major peak, but the distribution of mortality was skewed with a long tail end of the distribution (e.g., New York, New Jersey, and Michigan). Importantly, the mixture modeling approach accommodates the fit of both multimodal and unimodal skewed distribution as shown in Supplemental Figure 1 . The shapes of the mortality curves reveal that even for states that have successfully lowered mortality relative to its peak, it remains consistently greater than zero with a long tail or is even increasing. This is an indication of how challenging it will be to eradicate the pandemic or to reduce the risk of new surges if control measures are phased out too quickly. There is limited information to inform how a post-peak world will appear. At this point, we lack sufficient data on numerous parameters, including duration of immunity, the degree of public compliance with social distancing over time, and the political and governmental response to COVID-19, among others. 8 A gradual and data-driven relaxation of restrictions accompanied by continuous monitoring is necessary to avert an exponential increase in the cumulative number of cases. 19 Alterations in social mixing patterns and increased contact among susceptible individuals will clearly result in ongoing challenges to achieving control of the pandemic. 20 Our monthly predictions run through September 30, 2020, at which point the number of projected deaths is low for many states. However, as children return to school, lockdown orders expire, social distancing behaviors are relaxed, and individuals engage in greater social mixing including traveling during the holidays; there is likely to be a prolongation of transmission potentially accompanied by new surges and an overall increase in COVID-19 mortality. This also serves to underscore the importance of regularly updating model projections using an appropriate number of mixtures to capture new surges as they occur. Mathematical models can play a key role in better understanding the course of the pandemic. However, it is also important to be familiar with their underlying assumptions, strengths, and limitations. Given the dynamic and rapidly changing nature of the pandemic, any long-term projections will be sensitive to unforeseen changes. As such, these models are most reliable at shorter term monthly projections, and for monitoring trends, which inform planning for optimal management and distribution of resources, and evaluating the impact of control measures on the pandemic. Conversely, long-term projections for number of cases or deaths are sensitive to even small daily changes as these can translate into larger cumulative changes. This does not necessarily speak to model shortcomings as much as it confirms the dynamic nature of the disease transmission and changes in factors related to it. Specifically, in the last several months (after this article was submitted), two vaccines against COVID-19 were developed by Pfizer and Moderna. Although both vaccines are highly effective, there are many logistic challenges to achieve a high rate of vaccination. In addition, new strains of the virus are occurring, and it is hard to know what the new strains will look like in months from now or how resistant they will be to the vaccines. The proposed Bayesian mixture model is an effective tool for monitoring the pandemic over time and consequently provides monthly projections. Such model can and should be used in a rolling bases as new data come in. Whereas updating estimates using additional data is helpful, constant changes of the model used for prediction introduce the risk of overfitting the observed data, and potentially give rise to inconsistent projections. Any update of a model should be guided by theory that may include using different numbers of mixtures based on the data or the occurrence of new factors affecting the pandemic that may result in new surges. The quality of the model prediction will also depend on the availability of data. In the early stages of an epidemic, or even the early post-peak phases, data are often limited, with a weak data signal relative to the noise. Consequently, any model projections will be more sensitive to initial assumptions or prior information. Our Bayesian approach can accommodate different levels of prior knowledge and uncertainty into the model, such as information from other countries by introducing informative prior distributions. In general, using weakly informative priors is preferred, 16 as they have low impact in early projections that quickly fade away while more data become available, and in return, they improve model convergence. For the current modeling, we did not use informative priors in any of the model parameters. In summary, Bayesian mixture models are useful for monitoring and predicting COVID-19-related mortality in the United States or globally. These models are particularly helpful for identifying multiple surges and forecasting trajectories of skewed and multimodal curves. The results for the United States based on data as of August 31, 2020 showed that many states are experiencing a second surge, which for many is of greater magnitude than the first. Our model was able to more accurately characterize the actual bimodal shape of the pandemic mortality curves through September 30 for many states or unimodal but skewed curves reflecting the prolonged recovery for other states like New York. We are running our model regularly using the most updated data; the model performs well and is able to capture the new surge (after August 31, 2020) by increasing the number of mixtures. Identifying and monitoring the dynamic or multiple surges is important to understanding why such sub-epidemics occurred, and to inform future policy and practice decisions to more effectively prevent them. Moreover, providing regular pandemic forecasts is needed to guide the introduction or phasing out of programmatic interventions intended to control transmission in addition to providing an evidence-based decision-making for optimal resource allocation to address feature health needs. Coronavirus Disease (COVID-19) Pandemic Infographic: What Share of the World Population is Alreadyon COVID-19 Lockdown? An Interactive Map of the US Cities and States Still Under Lockdown -and Those that Are Reopening Association of public health interventions with the epidemiology of the COVID-19 outbreak in Wuhan, China Coronavirus disease 2019 (COVID-19): a literature review Substantial undocumented infection facilitates the rapid dissemination of novel coronavirus (SARS-CoV-2) Epidemiology and transmission of COVID-19 in 391 cases and 1286 of their close contacts in Shenzhen, China: a retrospective cohort study Predictive mathematical models of the COVID-19 pandemic underlying principles and value of projections Modeling epidemics with compartmental On behalf of the Imperial College COVID-19 response Team Imperial College COVID-19 Response Team Fundamental principles of epidemic spread highlight the immediate need for large-scale serological surveys to assess the stage of the SARS-CoV-2 epidemic IHME COVID-19 health service utilization forecasting team, Murray CJL, 2020. Forecasting COVID-19 impact on hospital bed-days, ICU-days, ventilator-days and deaths by US state in the next 4 months Models overestimate Ebola cases Bayesian Data Analysis The prior can often only Be understood in the context of the likelihood Skewness and mixture of normal distributions Base SAS 9.4 Utilities COVID-19 scenarios for the United States First-wave COVID-19 transmissibility and severity in China outside Hubei after control measures, and second-wave scenario planning: a modelling impact assessment Changes in contact patterns shape the dynamics of the COVID-19 outbreak in China Acknowledgments: Publication charges for this article were waived due to the ongoing pandemic of COVID-19.Authors' addresses: Niko A. Kaciroti, Carey Lumeng, Vikas Parekh, and Matthew L. Boulton, University of Michigan, Ann Arbor, MI, E-mails: nicola@med.umich.edu, clumeng@umich.edu, viparekh@ med.umich.edu, and mboulton@umich.edu. This is an open-access article distributed under the terms of the Creative Commons Attribution (CC-BY) License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.