key: cord-0597743-uaqfmlzu authors: Hawryluk, Iwona; Mellan, Thomas A.; Hoeltgebaum, Henrique H.; Mishra, Swapnil; Schnekenberg, Ricardo P.; Whittaker, Charles; Zhu, Harrison; Gandy, Axel; Flaxman, Seth; Bhatt, Samir title: Inference of COVID-19 epidemiological distributions from Brazilian hospital data date: 2020-07-15 journal: nan DOI: nan sha: d610f641a6c0313714c169e55634250794b2e081 doc_id: 597743 cord_uid: uaqfmlzu Knowing COVID-19 epidemiological distributions, such as the time from patient admission to death, is directly relevant to effective primary and secondary care planning, and moreover, the mathematical modelling of the pandemic generally. Here we determine epidemiological distributions for patients hospitalised with COVID-19 using a large dataset (range $N=21{,}000-157{,}000$) from the Brazilian SIVEP-Gripe (Sistema de Informac{c}~ao de Vigil^ancia Epidemiol'ogica da Gripe) database. We fit a set of probability distribution functions and estimate a symptom-onset-to-death mean of $15.2$ days for Brazil, which is lower than earlier estimates of 17.8 days based on early Chinese data. A joint Bayesian subnational model is used to simultaneously describe the $26$ states and one federal district of Brazil, and shows significant variation in the mean of the symptom-onset-to-death time, with ranges between $11.2-17.8$ days across the different states. We find strong evidence in favour of specific probability distribution function choices: for example, the gamma distribution gives the best fit for onset-to-death and the generalised log-normal for onset-to-hospital-discharge. Our results show that epidemiological distributions have considerable geographical variation, and provide the first estimates of these distributions in a low and middle-income setting. At the subnational level, variation in COVID-19 outcome timings are found to be correlated with poverty, deprivation and segregation levels, and weaker correlation is observed for mean age, wealth and urbanicity. Surveillance of COVID-19 has progressed from initial reports on 31st-Dec-2019 of pneumonia with unknown etiology in Wuhan, China, 4 to the confirmation of 9, 826 cases of SARS-CoV-2 across 20 countries one month later, 5 to the current pandemic of greater than 12 million confirmed cases and 500, 000 deaths globally to date. 6 Early estimates of epidemiological distributions provided critical input that enabled modelling to identify the severity and infectiousness of the disease. The onsetto-death distribution, 7,8 characterising the range of times observed between the onset of first symptoms in a patient and their death, has for example proved crucial in early estimates of the Infection Fatality Ratio (IFR), 3 and was similarly integral to recent approaches to modelling the transmission dynamics of SARS-CoV-2. 1, [9] [10] [11] [12] [13] Initial estimates of COVID-19 epidemiological distributions necessarily relied on few data points, with the events comprising these distributions occurring a period of time that was short compared to the temporal pathologies of the disease progression, resulting in large credible intervals and a sensitivity to timeseries censoring effects. 3 Global surveillance of the disease over the past 197 days has provided more data to re-evaluate the time-delay distributions of the dis-ease. In particular, public availability of a large number of patient-level hospital records -currently over 390, 000 in total -from the SIVEP-Gripe (Sistema de Informao de Vigilncia Epidemiolgica da Gripe) database published by Brazil's Ministry of Health (MoH), 2 provides an opportunity to make robust statistical estimates of the onset-to-death and other time-delay distributions such as onset-to-diagnosis, length of ICU stay, onset-tohospital-admission, onset-to-hospital-discharge, onsetto-ICU-admission, and hospital-admission-to-death. In this work we fit and present an analysis of these epidemiological distributions, with the paper set out as follows. Section II describes the data used from the SIVEP-Gripe database, 2 and the methodological approach applied to fit distributions using a hierarchical Bayesian model. Section III provides a description of the results from this study from fitting epidemiological distributions at national and subnational level to a range of probability distribution functions. The results are discussed in Section IV, including associations with socioeconomic factors, such as education, segregation, and poverty, and conclusions are given in Section V. The SIVEP-Gripe database provides detailed patientlevel records for all individuals hospitalised with severe acute respiratory syndrome, including all suspected or confirmed cases of severe COVID-19. 2 The records include the date of admission, date of onset of symptoms, state where the patient lives, state where they are being treated, and date of outcome (death or discharge), among other diagnosis related variables. We extracted the data for confirmed COVID-19 records starting on 25th February and considered records in our analysis ending on 7th July. The dataset was filtered to obtain rows for length of ICU stay, onset-to-death, onset-to-diagnosis, onset-toadmission, onset-to-ICU, admission-to-death and onsetto-hospital-discharge distributions. Onset-to-diagnosis was split into the diagnosis confirmed by PCR and those confirmed by other methods, such as rapid antibody and antigen tests, called non-PCR throughout this manuscript. Entries resulting in distribution times greater than 133 days were considered a typing error and removed, as the first recorded COVID-19 case in Brazil was on 25th February. 14 Additional filtering to the data was applied for onset-to-ICU-admission, onset-to-hospital-admission and onset-to-death in order to eliminate bias introduced by potentially erroneous entries identified in the data for these distributions. We removed the rows where admission to the hospital or ICU or death happened on the same day as onset of symptoms, assuming that these were actually incorrectly inputted entries. The decision to test removing the first day is motivated firstly by the observation of a number of conspicuous data entry errors in the database, and secondly by anomalous spikes corresponding to same-day events observed in these distributions. An example of the anomalous spikes in the onset-to-death distribution is shown in Appendix Figure 4 for selected states. Sensitivity analyses on data inclusion, regarding the removal of anomalous spikes in first-day data indicative of reporting errors (e.g. in onset to hospital admission), and regarding the sensitivity of the dataset to time-series censoring effects, are set out in the Results section III C. A summary of the data, including number and a range of samples per variable from the SIVEP-Gripe dataset is given in Table I . A breakdown of the number of data samples per state is provided in Appendix Table IX. Basic exploratory analysis to explain geographic variation observed in time-delay distributions adopts GeoSES (ndice Socioeconmico do Contexto Geogrfico para Estudos em Sade), 15 which measures Brazilian socioeconomic characteristics through an index composed of education, mobility, poverty, wealth, deprivation, and segregation. We investigate correlations between the GeoSES indicators and the time-delay means that we estimate at the state level. Additionally, we consider correlations with the mean age of the population of the state and the percentage of people living in urban areas, data we obtained from Instituto Brasileiro de Geografia e Estatstica (IBGE ). 16 Gamma, Weibull, log-normal, generalised lognormal, 17 and generalised gamma 18 probability distribution functions (PDFs) are fitted to several epidemiological distributions, with the specific parameterisations provided in the Appendix Section IX A. The parameters of each distribution are fitted in a joint Bayesian hierarchical model, using data from the 26 states and one federal district of Brazil, extracted and filtered to identify specific epidemiological distributions such onset-to-death, ICU-stay, and so on. As an example consider fitting a gamma PDF for the onset-to-death distribution. The gamma distribution for the i th state is given by where shape and scale parameters are assumed to be positively constrained, normally distributed random variables and The national level parameters α Brazil and β Brazil denote the national level estimates, and where N + (·) is a truncated normal. In this case, parameters α Brazil and β Brazil are estimated by fitting a gamma distribution to the fully pooled data, that is including the observation for all states. Prior probabilities for the national level parameters for each of the considered PDFs are chosen to be N + (0, 1), except for the generalised gamma distribution where we used: µ Brazil ∼ N + (2, 0.5), σ Brazil ∼ N + (0.5, 0.5) and s Brazil ∼ N + (1.5, 0.5). Posterior samples of the parameters in the model are generated using Hamiltonian Monte Carlo (HMC) with Stan. 19, 20 For each fit we use 4 chains and 2, 000 iterations, with half of the iterations dedicated to warm-up. The preference for one fitted model over another is characterised in terms of the Bayesian support, with the model evidence calculated to see how well a given model fits the data, and comparison between two models using Bayes Factors. The details of how to estimate the model evidence and calculate the Bayes Factors for each pair of models are given in the Appendix Section IX A. Five trial probability distribution functions (PDFs) -Weibull, gamma, log-normal, generalised log-normal and generalised gamma -were fitted to the epidemiological data shown in Figure 1 . All of the models' fits were tested by using the Bayes Factors based on the Laplace approximation and corrected using thermodynamic integration, 21-23 as described in the Appendix IX A. The thermodynamic integration contribution was negligible suggesting the posterior distributions are satisfactorily approximated as multivariate normal. The conclusions on the preferred PDF were not sensitive to the choice of priors, that is the preferred model was still the favoured one even when more informative priors were applied for all PDFs. The Bayes Factors used for model selection are shown in the Appendix, Table V . The gamma distribution provided the best fit to the onset-to-death, ICU-stay and hospital-admissionto-death data. For the remaining distributionsonset-to-diagnosis (non-PCR), onset-to-diagnosis (PCR), onset-to-hospital-discharge, onset-to-hospital-admission and onset-to-ICU-admission -the generalised log-normal distribution was the preferred model. The list of preferred PDFs for each distribution, together with the estimated mean, variance and PDFs' parameter values for the national fit are given in Table II . The 95% credible intervals (CI) for parameters of each of the preferred PDFs was less than 0.1 wide, therefore in the Table II we show only point estimates. Additionally, in Figure 1 , in each instance the cumulative probability distribution is given for the best model fit, revealing that out of patients for whom COVID-19 is terminal, almost 70% die within 20 days of symptom onset. Out of patients who die in the hospital, almost 60% die within the first 10 days since admission. The estimated mean number of days for each distribution for Brazil is compared in Table III with values found in the literature for China, US and France. The majority of the data obtained through searching the literature pertained to the early stages of the epidemic in China, and no data was found for low-and middle-income countries. The mean onset-to-death time of 15.2 (95% CI 15.1 − 15.3) days, from a best-fitting gamma PDF, is lower than the 17.8 (95% CI 16.919.2) days estimate, 3 and 20.2 (95% CI 15.1 − 29.5) days estimate (14.5 days without truncation) from Linton et al. 13 In both cases, estimates were based on a small sample size from the beginning of the epidemic in China. The mean number of days for hospital-admission-to-death for Brazil of 10.8 (95% CI 10.7 − 10.9) matches closely the 10 days estimated by Salje et al. 24 The onset-to-death distribution, and other time-delay distributions such as onset-to-diagnosis, length of ICU stay, onset-to-hospital-admission, onset-to-hospitaldischarge, onset-to-ICU-admission, and hospitaladmission-to-death, have been fitted in a joint model across the 26 states and one federal district of Brazil. The mean number of days, plotted in Figure 2 , shows substantial subnational variability -e.g. the mean onset-to-hospital-admission for Amazonas state was estimated to be 9.9 days (95% CI 9.7-10.1), whereas in Mato Grosso do Sul 6.7 (95% CI 6.4-7.1) days and Rio de Janeiro -7.2 days (95% CI 7.1-7.3). Amazonas state had the longest average time from onset to hospital and ICU admission. The state with the shortest average onset to death time was Acre. Santa Catarina state on the other hand had a longest average onset-to-death and hospital-admission-to-death time, as well as longest average ICU stay. For a visualisation of the uncertainty in our mean estimates for each state, see the posterior density plots in Appendix Figures 5 and 6. Additional national and state-level results for the onset-to-death gamma PDF, including the posterior plots for mean and variance, are shown in Figure 7 in the Appendix. We also observe discrepancies between the five geographical regions of Brazil, for example states belonging to the southern part of the country (Paran, Rio Grande do Sul and Santa Catarina) had a higher average ICU stay and hospital admission to death time as compared to the states in the North region. Full results, including detailed estimates of mean, variance, and estimates for each of the distribution's parameters for Brazil and Brazilian states can be accessed at https://github. com/mrc-ide/Brazil_COVID19_distributions/blob/ master/results/results_full_table.csv. In order to remove the potential bias towards shorter outcomes from left-and right-censoring, we tested the scenario in which the data to fit the models was truncated. For example, based on a 95% quartile of 35 days for the hospital-admission-to-death distribution, entries with the starting date (hospital admission) after 2nd June 2020 and those with an end-date (death) before 1st April were truncated, and the models were refitted. With censored parts of the data removed, the mean time from . Histograms for ICU-stay, onset-to-diagnosis (PCR), onset-to-diagnosis (non-PCR), onset-to-ICU-admission, onsetto-hospital-admission, onset-to-hospital-discharge and hospital-admission-to-death distributions show data for Brazil extracted from the SIVEP-Gripe database. 2 For each distribution solid lines are for fitted PDFs and the dashed line shows the cumulative distribution function of the best-fitting PDF. The left hand side y-axis gives the probability for the PDFs and the right hand side y-axis shows the cumulative probability. All values on the x-axes are given in days. State-level fits are shown in Figure 2 and Appendix Figures 5 and 6. start to outcome increased for every distribution, e.g. for hospital-admission-to-death it increased from 10.0 (95% CI 9.9-10.0) to 10.8 (95 % CI 10.7-10.9), and for onsetto-death it changed from 15.2 days (95% CI 15.1-15.3) to 16.0 days (95% CI 15.9-16.1). The effect truncation on censored data is given in Appendix Figure 8 . To test the impact of keeping or removing entries identified as potentially resulting from erroneous data transcription (see the Methods Section II), we fitted the PDFs to some of the distributions on a national level with and without those rows. For onset-to-hospital-admission, onset-to-ICU and onset-to-death we find that generalised gamma PDF was preferred when the first day of the distribution was included, and gamma (for onset-to-death) and generalised log-normal PDFs if the first day was removed. For hospital-admission-to-death, a gamma distribution fitted most accurately when the first day was included, and Weibull when it was excluded. The effect of removing the first day results in means shifting to the right by approximately 1 day for both onset-tohospital and ICU admission, and by 0.5 days for hospitaladmission-to-death (see Appendix Figure 8 ). Sensitivity analysis regarding the model selection approach is detailed in Appendix IX A. We have fitted multiple probability distribution functions to a number of epidemiological datasets, such as onset-to-death or onset-to-diagnosis, from the Brazilian SIVEP Gripe database, 2 using Bayesian hierarchical models. Our findings provide the first reliable estimates of the various epidemiological distributions for the COVID-19 epidemic in Brazil and highlight a need to consider a wider set of specific parametric distributions. Instead of relying on the ubiquitous gamma or log-normal distributions, we show that often these PDFs do not best capture the behaviour of the data -for instance, the generalised log-normal is preferable for several of the epidemiological distributions in Table II . These results can inform modelling of the epidemic in Brazil, 30 and other low-middle-income countries, 31 but we expect they also have some relevance more generally. In terms of modelling the epidemic in Brazil, the variation observed at subnational level -see Figure 2 -can be shown to be important to accurately estimating disease progression. Making use of the state-level custom-fitted onset-to-death distributions reported here, we have estimated the number of active infections on 23rd June 2020 across ten states spanning the five regions of Brazil, using a Bayesian hierarchical renewal-type model. 1, 32, 33 The relative change in the number of active infections from modelling the cases using heterogeneous state-specific onset-to-death distributions, compared to using a single common Brazil one is shown in Figure 3 to be quite substantial. The relative changes observed, up to 18% more active infections, suggest assumptions of onset-to-death homogeneity are unreliable and closer attention needs to be paid when fitting COVID-19 models in large countries. On the origin of the geographic variation displayed in Figure 2 for the average distribution times across states, there are multiple potential factors that could generate the observed variability and in this work we present an elementary exploratory analysis. We examine the correlation between socioeconomic factors, such as educa-tion, poverty, income, etc., using a number of socioeconomic state-level indicators obtained from Barrozo et al.(2020) 15 and additional datasets containing the mean age per state and percentage of people living in the urban areas (urbanicity). 16 The Pearson correlation coefficients, shown in Table VII , suggest that segregation, poverty and deprivation elements were most strongly correlated with the analysed onset-time datasets. E.g. poverty was strongly negatively correlated with hospital-admissionto-death (-0.68), whereas income and segregation had a high positive correlation coefficient for the same distribution (+0.60, +0.62 respectively). The strongest correlation was observed for hospital-admission-to-death and deprivation indicator, which measure access to sanitation, electricity and other material and non-material goods. 15 Interestingly, the indicators measuring economical situation were more correlated with average hospitalisation times than mean age per state, which suggests that although the low-and middle-income countries typically have younger populations, they are more likely to observe incapability of the healthcare system to deal with COVID-19 epidemic. More detailed analysis is necessary to fully appreciate the impact of the economic components on the COVID-19 epidemic response. In the work presented we acknowledge numerous limitations. The database from which distributions have been extracted, though extensive, contains transcription errors, and the degree to which these bias our estimates is largely unknown. Secondly, the PDFs fitted are based on observational hospital data, and therefore should be cautiously interpreted for other settings. Thirdly, though we have fitted PDFs at subnational as well as national level, this partition is largely arbitrary and further work is required to understand the likely substantial effect of age, sex, ethnic variation, 34 co-morbidities, and other factors. Mean Figure 3 . This figure shows the percentage change in active infections, estimated on the 23rd-Jun-2020, that results from using state-specific onset-to-death distributions (see Appendix Table VI ) compared to a single national-level one. The effect for each state is coloured according to the mean of the state's onset-to-death gamma distribution. The mean onsetto-death for Brazil is 15.2 days. We provide the first estimates of common epidemiological distributions for the COVID-19 epidemic in Brazil, based on the SIVEP-Gripe hospitalisation data. 2 Extensive heterogeneity in the distributions between different states is reported. Quantifying the time-delay for COVID-19 onset and hospitalisation data provides useful input parameters for many COVID-19 epidemiological models, especially those modelling the healthcare response in low-and middle-income countries. We thank Microsoft for providing Azure credits which were used to run the analysis. IH was funded by Imperial College MRC Centre, as part of the MRes in Biomedical Research, Epidemiology, Evolution and Control of Infectious Disease programme. Python, R and Stan code used to analyse the data and fit the distribution is available at https://github. com/mrc-ide/Brazil_COVID19_distributions, along with estimated parameters for each state and probability distribution functions considered at https://github. com/mrc-ide/Brazil_COVID19_distributions/blob/ master/results/results_full_table.csv. The SIVEP-Gripe database, 2 is available to download from Brazil Ministry of Health website https:// opendatasus.saude.gov.br/dataset/bd-srag-2020. To characterise which model (gamma, log-normal, etc.) best fits the data, the Bayesian model evidence z = z(y|M i ) is evaluated. Here and throughout this section y denotes the data and M i denotes the i th model from the analysed model set. As determining the model evidence requires an integral over the model parameters (θ) which is generally intractable, we approximate it with z 0 = z 0 (y|M i ), which is based on a second-order Laplace approximation, 35 q 0 = q 0 (M i , θ|y), to the true un-normalised posterior density q = q(M i , θ|y). The second-order approximated density is estimated as: Here q(θ) denotes the value of the un-normalised posterior evaluated using the mean estimates of the model's parametersθ, and Σ the covariance matrix built from MCMC samples of the posterior. From this expression, a second-order approximation to the model evidence, z 0 , is given by z 0 = q(θ) det(2πΣ −1 ), where det(·) denotes the determinant of the matrix. For each model pair, Bayes factors were computed from the marginal likelihoods. Considering two models M i and M j , the Bayes Factor (BF) is where z(y|M i ) is the evidence of model M i given y. If B ij > 1, the evidence is in favour of model M i . Here, for readability we will report the Bayes Factors as 2 log(B ij ) following Kass and Raftery notation. 36 The sensitivity of our model evidence is tested with respect to the choice of hyperprior distribution, and secondly with respect to the use of the approximate secondorder density q 0 . In the latter instance this is done by performing thermodynamic integration 21-23 between q 0 and the true density q in order to obtain an asymptotically exact estimate of the marginal model evidence, The right hand term corrects the z 0 approximation to the exact Bayesian evidence by a path integral evaluated with respect to a sampling distribution that interpolates between the two densities as q(θ; λ) = q (1−λ) q λ 0 in terms of the auxiliary coordinate λ. Table IV . Probability distribution functions (PDFs) with analytical formulae for mean and variance. y denotes the data, Γ(·) is a gamma function. GG -generalised gamma, GLN -generalised log-normal. Estimating the effects of non-pharmaceutical interventions on COVID-19 in Europe SRAG 2020 -Banco de Dados de Sndrome Respiratria Aguda Grave Estimates of the severity of COVID-19 disease World Health Organisation, WHO, Coronavirus disease 2019 (COVID-19) Situation Report 1 World Health Organisation, WHO, Coronavirus disease 2019 (COVID-19) Situation Report 11 World Health Organisation, WHO, Coronavirus disease 2019 (COVID-19) Epidemiological determinants of spread of causal agent of severe acute respiratory syndrome in Hong Kong Assessing the severity of the novel influenza A/H1N1 pandemic First-wave COVID-19 transmissibility and severity in China outside Hubei after control measures, and secondwave scenario planning: a modelling impact assessment Brazilian Modeling of COVID-19 (bram-cod): a Bayesian Monte Carlo approach for COVID-19 spread in a limited data set context Estimating clinical severity of COVID-19 from the transmission dynamics in Wuhan, China Inferring the number of COVID-19 cases from recently reported deaths Incubation period and other epidemiological characteristics of 2019 novel coronavirus infections with right truncation: a statistical analysis of publicly available case data COVID-19 in Brazil: "so what? GeoSES: A socioeconomic index for health and social research in Brazil Brazilian Institute of Geography and Statistics, Ibge Projees da Populao A generalized log-normal distribution and its goodness of fit to censored data A generalization of the gamma distribution Stan: A probabilistic programming language The No-U-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo Simulating normalising constants using referenced themodynamic integration Simulating ratios of normalizing constants via a simple identity: a theoretical exploration Simulating normalizing constants: From importance sampling to bridge sampling to path sampling Transmission characteristics of the COVID-19 outbreak in China: a study driven by data Temporal estimates of case-fatality rate for COVID-19 outbreaks in Canada and the United States Clinical characteristics of 113 deceased patients with coronavirus disease 2019: retrospective study Estimating the burden of SARS-CoV-2 in France Clinical course and risk factors for mortality of adult inpatients with COVID-19 in Wuhan, China: a retrospective cohort study Clinical course and outcomes of critically ill patients with SARS-CoV-2 pneumonia in Wuhan, China: a single-centered, retrospective, observational study The impact of COVID-19 and strategies for mitigation and suppression Estimating COVID-19 cases and reproduction number in Brazil, medRxiv On the derivation of the renewal equation from an agedependent branching process: an epidemic modelling perspective Ethnic and regional variations in hospital mortality from COVID-19 in Brazil: a cross-sectional observational study Accurate approximations for posterior moments and marginal densities Rondnia (RO), Roraima (RR) 898 793 124 3174 1952 1168 773 3490 RJ 1490 9750 1446 18019 7438 7165 9068 21159 RN 337 876 544 1878 664 693 821 1517 RO 180 254 293 554 180 284 238 488 RR 53 270 92 98 51 56 265 200 RS 971 790 477 3565 2328 1277 770 4144 SC 291 408 343 1600 777 599 389 1634 SE 193 303 117 938 181 306 295 1116 SP 8515 16348 4769 55735 32937 17642 15808 63184 TO 44 213 53 515 213 87 177 549