key: cord-0536855-6rusimb5
authors: Boukai, Ben; Wang, Jiayue
title: Bayesian Modeling of COVID-19 Positivity Rate -- the Indiana experience
date: 2020-07-09
journal: nan
DOI: nan
sha: e2141086ad35d67ce212a71c75af3331d069d6f9
doc_id: 536855
cord_uid: 6rusimb5

In this short technical report we model, within the Bayesian framework, the rate of positive tests reported by the the State of Indiana, accounting also for the substantial variability (and overdispeartion) in the daily count of the tests performed. The approach we take, results with a simple procedure for prediction, a posteriori, of this rate of 'positivity' and allows for an easy and a straightforward adaptation by any agency tracking daily results of COVID-19 tests. The numerical results provided herein were obtained via an updatable R Markdown document.

The Indiana State Department of Health, (ISDH), as any other similar entity across the nation and world-wide, is closely monitoring nowadays a pandemic of the 2019 novel Coronavirus, AKA: COVID-19. According to the World Health Organizing, (WHO), this highly contagious respiratory virus was first identified and reported in the city of Wuhan in China in early January 2020. Since then, this virus continues to spread and infect people around the world, including the United States. On March 11, 2020, the WHO published an assessment that COVID-19 can be characterized as a Pandemic. As of the date of this report (July 7, 2020), the John Hopkins University's Coronavirus Resource Center, reported over 11,500,000 infected people and 538,000 deaths due to the Coronavirus, globally, of which over 2,938,000 infections and 130,000 deaths are in the USA alone.

The state of Indiana was not spared of the contagious impact of the virus. On March 6, ISDH confirmed the first case of COVID-19 in a Hoosier with recent travel. On March 16, the ISDH reported the first death in Indiana due to . It subsequently worked with federal and local partners, including the Centers for Disease Control and Prevention, (CDC), to respond to this pandemic and the grave public health situation. Among other responses, the ISDH created a dashboard and a data depository for tracking and reporting the daily number of COVID-19 deaths in the State as well as the daily number of tests performed and the daily number of positive cases. While there is substantial (and urgent) effort being made, worldwide, for modeling of the the death rate (or the infection fatality rate, IFR) of COVID-19, (see for example Basu (2020) ), there has been very little attention given to modeling the rate of reported positive tests (often referred to rate of 'positivity').

In this report we model, within the Bayesian framework, the rate of positive tests reported by the the State of Indiana, accounting also for the substantial variability of the daily number test performed. The Bayesian approach has been used successfully in other COVID-19 related studies, e.g.: Dana, at. el. (2020), or Bayes, at. el. (2020). However, to the best of our knowledge, none of the available studies (to date) have direct relevancy to the modeling of the rate of positivity. The approach we take, provides a method for a valid prediction of this rate and allows for an easy and a straightforward adaptation by any agency tracking daily results of COVID-19 tests.

The Indiana COVID-19 data are available for retrieval through the ISDH Data Hub [10] as are reported for the state in the file covid report date.xlsx. It includes the daily records of the number of COVID-19 deaths cases, the total (daily) number of testings performed and the count of the positive cases, (see Appendix B for additional information).

For the purpose of this report, we focus attention only on two of the quantities reported; the daily reported number of test performed, COVID TEST and the daily reported number of positive tests, COVID COUNT as were reported for the (successive) 127 daily records to date. Their ratio, the daily Percentage of Positive Tests (PPT), is used for tracking and monitoring the pandemic's progression by the state and localities. The PPT is also an important indicator of the scope and extent of the state's testing enterprise; its high value suggests that testing is conducted primarily of the sickest patients and less so of the mild or the asymptomatic cases. A lower PPT may suggest that testing has been extended to cover patients with milder or no symptoms at all. According to the CDC Guidelines [7] , among other stipulations, a PPT ≤ 15% serves as a threshold for entering Phase II of the reopening plans for the State, whereas achieving a PPT ≤ 10% serves a threshold of entering Phase III of the reopening plans. At present, the 7-day average of this rate (of positive tests) for Indiana is 7.61% and cumulatively, the Indiana PPT stands at 9.18% (i.e. the total number of positive tests relative to the total number of tests performed to date, see also the ISDH Dashboard, which enabled the state to enter Phase III of the reopening plans. Thus, the importance of the appropriate modeling and tracking this percentage of reported positive tests cannot be overstated, as it has tremendous economic implications for the State and its people. In particular, constructing a valid predictive model, which predicts reasonably well the 'next-day' PPT, would provide the State with an early sign of a looming surge in positive cases, and would allow it to implement corrective measures and policies. Since the available records for the latest few days is still updating, we included records up to the last 3 days in the series. Thus, the data series includes m = 109 data points out of the available 127 records. Also marked in the figure are two horizontal lines indicating the 10% and 5% thresholds. In Appendix A below, we provide basic descriptive statistics concerning the daily reported number of tests and the daily reported number of positive tests as well as of their ratio, the PPT. In Figure 2 above, we present the cumulative PPT rate for Indiana over these days. We also super-imposed on the daily chart the 0.95 (or 95%) posterior predictive interval of [0.07981, 0.1001] for the current Indiana (cumulative) PPT rate, as resulted from the Bayesian predictive modeling we developed and present in this report (see Illustrations 1 and 2 below). As can be seen, the calculated posterior prediction interval contains the CDC's 10% threshold, which could serve as an early warning flag to the State and may prompt it to initiate some mitigating measures (such as mandating masks in public) in an effort to maintain the rate of positivity below it.

Having retrieved the data as described above, we labeled them as y i ≡COVID COUNT and k i ≡COVID TESTS for i = 1, . . . , m where m = 109 is the total number of observations included in the analysis. Thus, we denote the given data as

Ordinarily, when modeling the count of positive test results y i recorded out of the k i tests performed (assuming independence, risk homogeneity of the tested population and no lagging information), the binomial model would be appropriate. So that conditional on the number of tests performed, and a given p ∈ (0, 1)

Accordingly, given k 1 , k 2 , . . . , k m , the daily counts of the positive tests, y 1 , y 2 , . . . , y m are conditionally independent binomial random variables with p = Pr(T est = +). In a similar fashion, we model the reported number of tests performed k 1 , k 2 , . . . , k m as (independent) Negative Binomial random variables. That is, for given (fixed) integer r > 0, and θ ∈ (0, 1),

As we can see from Appendix A, the observed marginal distribution of the reported daily number of test performed is exhibiting over-dispersion features that are characteristic to mixed-Poisson or to Negative Binomial counts. Thus, combining (1)- (2), we obtain that given (p, θ), the joint probability (mass) function of the m pairs, (y i , k i ), i = 1, . . . , m, (i.e. the likelihood function), is

where, ∝ indicates proportionality of terms, up to a constant, and where N m = m j=1 k j and X m = m j=1 y j are the cumulative reported number of tests performed and the cumulative reported positive tests. We note in passing that the cumulative PPT rate mentioned in the Introduction is merely the ratio,p m := X m /N m .

The standard Bayesian predictive model in the case of Binomial-Negative Binomial counts (as in (1)-(2)), is that with the conjugate Beta-Beta joint prior distribution on (p, θ) (see for example Gelman et. al. (2014) ). That is, in the case of the likelihood function f (D m |p, θ) given in (3) above, we consider the Bayesian model that assumes that for given p, and N m ,

for some a > 0 and b > 0 and where, given θ > 0,

for some fixed integer r > 0, and some c > 0, and d > 0. Thus, the joint prior pd f of (p, θ) is,

Accordingly, and since the joint posterior pd f for (p, θ) **given** the data D m , is

we immediately obtain from (3)-(6) (due to the conjugacy) that **given** the data, X m and N m , the (marginal) posterior distribution of p, denoted as π(p |X m , N m ), is also a Beta distribution and the (marginal) posterior distribution of θ, denoted as π(θ |N m ), is also a Beta distribution. Specifically, it follows that given X m and N m ,

where these four updated parameters are given by:

Hence, the Bayes estimates of θ and p given the data (X m , N m ), are the respective posterior means:p

Remark: The choice in (2) for using the Negative Binomial distribution to model the reported daily number of tests k i , could be seen as specific to the Indiana COVID-19 testing data, which might reflect testing capacity limitation and daily variability unique to that state. Other models that account for the observed overdispersion characteristics of the data (See Appendix A) could be used instead. For instance, one may alternatively consider the related mixed-Poisson distribution in (2). 

Having observed (X m , N m ), the posterior predictive distribution , under the Bayesian model (3)-(6) and given (X m , N m ), of a 'new' (or 'future') observation on the number of positive cases, Y * , out of a given K * = k * new tests is the the beta-binomial distribution,

for y * = 0, 1, . . . , k * . We denote this (predictive) distribution for Y * , given K * = k * , as

with a m and b m as in (7). The corresponding posterior predictive mean and variance of Y * are given by

In a similar manner we obtain the posterior predictive distribution under this Bayesian model and given (X m , N m ), of a 'new' (or 'future') number of tests K * is the Beta-Negative Binomial distribution. In fact, with K * | θ ∼ NegBinom(r, θ), we have

for k * = 0, 1, 2, . . . . We denote this (posterior predictive) distribution for K * as

with c m and d m as in (7). The corresponding posterior predictive mean and variance of K * are given by

It is straightforward to see that the joint posterior predictive probability (mass) function of (Y * , K * ), can easily be obtained from expressions (7) and (8) is

Clearly, along with the value of Y * as the number of positive tests predicted out of the predicted number of tests, K * , one may obtain their ratio, p * := Y * /K * , as the rate of positive tests predicted next, given the data. While an explicit expression for the posterior predictive distribution of p * is not readily available, it may be estimated, quite accurately, through Monte-Carlo simulations. Towards that end, we denote by Q * m (·) the posterior predictive cd f of p * , given the data D m . That is, for any t ∈ R,

and recall that the α th percentile (α ∈ (0, 1)), of this distribution, is defined as

Thus, when available, the interval, [t * 1−α , t * α ] serves as a 1 − 2α posterior prediction interval for p * given the data D m , so that

As was mentioned in the previous section, while explicit expression for Q * m (·), the posterior predictive cd f of p * , given the data D m , is not available, it may be estimated via Monte-Carlo simulations which exploit the explicit expression, in (9) , for the joint posterior predictive distribution of (Y * , K * ). Given the data, D m and with parameters a m , b m , c m and d m and with r as above, generate a random sample of a large size B (B = 5, 000, say) from Q * m (·) as follows:

1) Given N m , draw K * ∼ BetaNegBinom(r, c m , d m );

2) Given X m , N m and K * = k * , draw Y * ∼ BetaBinom(k * , a m , b m ) to obtain the pair (y * , k * );

3) Calculate the simulated predicted PPT as p * = y * /k * .

B times so as to form p * 1 , p * 2 , . . . , p * B as a random sample from Q * m (·).

Having obtained the random sample p * 1 , p * 2 , . . . , p * B , we estimate Q * m (·) by its empirical ver-sionQ *

Accordingly, and in similarity to (10), we estimate the α st percentile of Q * m (·) bŷ

A simple R script (see Appendix C) which utilizes the built-in functions rbnbinom() (for the BetaNegBinom(·) distribution) and rbbinom() (for the BetaBinom(·) distribution), of the extraDistr package produces the simulated sample from the (respective) posterior predictive distribution of (Y * , K * ) and the corresponding predicted PPT, p * = Y * /K * .

We continue with the same prior parameterization used in Illustration 1, of a = b = 1, c = d = 1 and r = 3. Recall that the given data D m , yields, m = 109, N m = 5.22946 × 10 5 and X m = 4.6907 × 10 4 . We first simulated B = 5000 sample values for the respective predictive distributions of (Y * , K * ) and of p * and used these simulated values to estimate the posterior predictive distribution of p * , which in turn, was used it to obtain the 95% posterior prediction interval, [0.07981, 0.1001] for the 'next-day' Indiana's PPT as was displayed in Figure 2 . The means and standard deviations of the (estimated) posterior predictive distributions of Y * , K * and p * , along with the corresponding 95% prediction interval are presented in Table 1 Figure 3 : The estimated (Monte-Carlo) posterior predictive distribution of Indiana PPT, p * , along with the marked (in blue) the 95% prediction interval Figure 3 above, displays the Monte-Carlo histogram of that predictive distribution, along with a nonparametric and a normal density (in red) approximations. Also marked are the corresponding bound for the 95% posterior prediction interval for p * . The Monte-Carlo marginal (posterior) distribution of K * is displayed in Figure 4 and that of Y * in Figure 5 . We conclude this illustration with Figure 6 , where we display the posterior prediction intervals for the PPT as were calculated for each report day in the series. That is, based on the given data on the n th day, D n , we calculated the 95% posterior predicted interval for the PPT on the (n + 1) th , day (the "next" day), for each n = 2, 3, . . . , m of the m = 109 days available in the data set. As can be seen, each of the daily calculated PPT (as in Figure 2 ), fell well within the corresponding prediction interval (marked in red in Figure 6 ) as was calculated based on the previous' days data, thus providing also a partial validation for the applicability of this Bayesian approach (with its underlying assumptions) to these COVID-19 count data. 

The Indiana Covid-19 data are available for retrieval through the ISDH data hub[10] as are reported for the state in the file covid report date.xlsx. It includes the daily records (as columns) on:

• DATE: Date of the event, it is equal to the investigation starting date for positive cases; the date of death for deaths; the coalesce of specimen date and report date for testings (if specimen collected date is unknown, use report date), respectively

• COVID TEST: Total number of testings (i.e. number of new people tested on the date. Indiana residents only)

• DAILY DELTA TESTS: The number of most recent (i.e latest report) new testings that are reported into the testing pool. The date of specimen collected is typically earlier than the report date

• DAILY BASE TESTS: The number of tests from the last report. Records might be removed due to information correctness

• COVID COUNT: Total number of positive cases (i.e. number of patients who started investigation for their positive report) on the date

• DAILY DELTA CASES: The number of most recent(i.e latest report) new positive cases that are reported into the positive case pool. The investigation starting date could be earlier than the report date due to necessary process

• DAILY BASE CASES: The number of positive cases from the last report. Records might be removed due to information correctness

• COVID DEATHS Total: number of deaths on the date

• DAILY DELTA DEATHS The number of most recent (i.e latest report) new death cases that are reported into the death case pool. The date of death could be earlier than the report date due to necessary process and confirmation

• DAILY BASE DEATHS: The number of deaths from the last report. Records might be removed due to information correctness

• COVID COUNT CUMSUM: The cumulative number of positive cases as of the report date

• COVID DEATHS CUMSUM: The cumulative number of deaths as of the report date

• COVID TEST CUMSUM: The cumulative number of tests as of the report date

A simple R script to obtain the Monte-Carlo sample from Q m (·), 

Bayesian Data Analysis Chapman & Hall/CRC Texts in Statistical Science

Estimating The Infection Fatality Rate Among Symptomatic COVID-19 Cases In The United States HEALTH AFFAIRS 39, NO. 7. Project HOPEThe People-to-People Health Foundation

Coronavirus COVID-19 Global Cases

COVID-19 positive cases, evidence on the time evolution of the epidemic or an indicator of local testing capabilities? A case study in the United States Available

Modelling death rates due to COVID-19: A Bayesian approach Available online at

Brazilian Modeling of COVID-19(BRAM-COD): a Bayesian Monte Carlo approach for COVID-19 spread in a limited data set context Available online at medRxiv preprint doi

CDC Guidelines-CDC Activities and Initiatives Supporting the COVID-19 Response and the Presidents Plan for Opening America Up Again -Centers for Disease Control and Prevention Available online as

Indiana COVID-19 Data Report-Indiana State Department of Health Available online