key: cord-0715984-23nz6dys
authors: Hebling Vieira, B.; Hiar, N. H.; Cardoso, G. C.
title: Uncertainty reduction in logistic regressions: a COVID-19 case-study using surrogate locations' asymptotic values
date: 2020-12-16
journal: nan
DOI: 10.1101/2020.12.14.20248184
sha: 39d9c79d4e1e50747f0850cc54e1482e83b71eaf
doc_id: 715984
cord_uid: 23nz6dys

Logistic regressions are subject to high uncertainty when the data are not past the inflection point. For example, for logistic regressions estimated with data up to or before inflection point, uncertainties in the upper asymptotic value $K$ can be of the same order of magnitude of the population under analysis. This paper presents a method for uncertainty reduction in logistic regression using data from a surrogate logistic process. We illustrate the procedure using the Richards' growth function (Generalized Logistic Function) to make predictions for COVID-19 evolution in Brazilian cities at stages before and during their epidemic inflection points. We constrain the logistic function regression with $K$ calculated from selected surrogate international cities where the epidemic is clearly past its inflection point. Information gained with this constraint stabilizes the logistic regression, reducing the uncertainty in the curves' parameters, including the rate of growth at the inflection point. The uncertainty is reduced even when the actual surrogate $K$ is used just as an anchor to simulate different epidemic scenarios. Results predicted for COVID-19 trajectories within Brazil agree with actual data. These results suggest that in the absence of big data, a simple logistic regression may provide low uncertainty if surrogate cities have been identified for estimates of $K$, even if the specifics of the evolution in the surrogate cities are different. The method may be used for other logistic models and for other logistic processes in other areas such as economics and biology, if surrogate processes can be identified.

ulation growth. In epidemics, K is the maximum number of accumulated 26 cases expected for the population under study. It depends on the pathogen 27 transmission mechanism, the social network structure and the underlying 28 contagion dynamics [1] . The key benefit of logistic models is their ability to 29 make predictions with little insight of the underlying specifics of the process, 30 as is often the case in real epidemic surges. 31 In epidemic problems, the characterization of the logistic model around 32 its inflection point enables estimation of its time derivative. Its peak, on the 33 other hand, informs the time of maximum epidemic growth and the corre-34 sponding maximum rate of new cases per day. Logistic models accommodate 35 different choices for the logistic function, and Verhulst's is the most tradi-36 tional one. It has been used to predict the evolution of COVID-19 C(t) 37 in countries, by finding the number of cases C(t critical ) = K/2 at its inflec-38 tion point, which is fed into a machine learning algorithm to predict further 39 evolution of the epidemic curve [13] . The procedure works as long as the in- 40 flection point has been reached in the data. Likewise, the GLF, also known 41 as Richards' growth model, was used to successfully predict the evolution 42 of COVID-19 of various localities, as long as the inflection point has been 43 identified [7, 12] . 44 However, logistic models overfit the data if regression is done before its because one of the reasons for modeling is to estimate both the time and the 51 number of cases at the peak of the epidemic, either for preparedness or for 52 the study of different strategies to deal with the disease.

Here, we use independent estimates of the logistic model's upper asymp-54 tote K to constrain the GLF, allowing for stable non-linear regression on data 55 series that end before or around the inflection point. We use surrogate cities 56 from the United States, Italy, the United Kingdom, and Spain where the 57 inflection point of the first COVID-19 wave has been identified, to estimate 58 plausible values for K, that we use to constrain the GLF to predict scenarios 59 in Brazilian cities. We need at least one surrogate city as an anchoring point.

The choice of the GLF is for convenience, since it has been shown in the liter-61 ature to work well for . To increase reliability of the 62 data used, we use the accumulated fatalities count curve D(t), and estimate 63 C(t) using the infection fatality rate (IFR). Our predictions use least squares 64 curve-fitting of the constrained GLF on the cumulative fatalities curve data countries for any process that obey logistic growth, such as epidemics. We have used the GLF to model the evolution of fatalities [8, 7] :

where K is the asymptotic value, r is the initial growth rate, α depends on 6 All rights reserved. No reuse allowed without permission. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this this version posted December 16, 2020. ; https://doi.org/10.1101 https://doi.org/10. /2020 To stabilize the logistic regressions of Equation 1 to data of cities of 85 interest where the inflection point has not been reached yet, we will use K 86 values derived from surrogate cities. The asymptote K for a city of interest 87 could, in principle, be estimated from an appropriate epidemic model. For 88 a system that obeys to the Susceptible-Infected-Recovered (SIR) model, the 89 relationship between R o and K is given by [16] . preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this this version posted December 16, 2020. ; https://doi.org/10.1101/2020.12.14.20248184 doi: medRxiv preprint (https://www.ibge.gov.br/cidades-e-estados; https://bigdata-api. 

The uncertainty of ∆Y peak can be approximated by the variance formula, 163 resulting in Equation 4.

where the uncertainty σ 2 Y (x) can be obtained by the formula for prediction 165 intervals, using the conventional Delta method for uncertainty propagation.

Results for K values for surrogate cities are shown in Table 1 for our data, preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this this version posted December 16, 2020. ; https://doi.org/10.1101/2020.12.14.20248184 doi: medRxiv preprint 

All rights reserved. No reuse allowed without permission. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this this version posted December 16, 2020. ; https://doi.org/10.1101/2020.12.14.20248184 doi: medRxiv preprint Here we discuss the results of using the GLF to predict the evolution of cases 185 12 All rights reserved. No reuse allowed without permission. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this this version posted December 16, 2020. ; https://doi.org/10.1101/2020.12.14.20248184 doi: medRxiv preprint 13 All rights reserved. No reuse allowed without permission. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this this version posted December 16, 2020. ; https://doi.org/10.1101/2020.12.14.20248184 doi: medRxiv preprint carrying capacity K, which are estimated from surrogate cities where the 187 epidemic is beyond the inflection point. This alternative approach seeks to 188 mitigate the over-fitting problem at the expense of increasing the uncertainty 189 about the actual scenario that will develop in the city being forecast. ertheless, this strategy substantially decreases the range of predicted peak 191 weeks and predicted peak number of cases per day, and decreases their sta-192 tistical uncertainties compared to unconstrained fittings.

Using surrogate values for K, we performed GLF regression to ian cities' fatalities curves starting from the onset of COVID-19 until June, 195 21/2020. Comparison with the actual observed curve by October/2020, 196 shows that the method gives ranges of expectation for the inflection point up 197 to 100 times narrower than when K is left as a free parameter.

Here we will discuss the major results. In surrogate cities, where the (July/2, 14 fatalities/day) [19] , that is compatible with K 12%. Notice

The GLF regression for the city of São Paulo shows K = 9.2% when K is one of the free parameters ( Figure 3f ). Yet, the uncertainty in the inflection Paulo. The peak number of fatalities found is also compatible with 123 246 deaths/day for the 7-day moving average reported in the official data [19] . 247 We can also observe in Figure 3f worse case scenarios, where K would grow to 248 25%. In such scenario the epidemic peak is predicted to happen a month later 249 but with at least 50% more casualties -this could happen if the population 250 that followed an initial trend relaxes their prophylactic measures.

It is important to have in mind the possibility of overloading the using generalized exponentials [20, 21] .

In conclusion, the proposed method gives a complementary prescription 310 18 All rights reserved. No reuse allowed without permission. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this this version posted December 16, 2020. ; https://doi.org/10. 1101 /2020 to the traditional growth prediction methods. It can apply to any early lo-311 gistic process, where there is an earlier surrogate process for estimation of 312 the carrying capacity. This strategy overcomes the intrinsic inability logistic 313 models have for prediction of the inflection point. It can be used not only for 314 epidemics but also for commerce, economics, viral information dissemination 315 in a population. Our method lowers the uncertainty in the prediction for 316 optimistic and pessimistic epidemic scenarios without the need for sophisti-317 cated models or big data type of resources, and gives decision-makers anchor 318 points specific for their situation.

Data and code availability statement 320 Code and data are available online. See: https://github.com/bhvieira/CovidRichards/

[3] Hébert-Dufresne L, Althouse BM, Scarpino SV, Allard A. Beyond R0:

Epi-326 demic processes in complex networks

2: Data from representative Brazilian cities. Population and fatalities statistics for Brazil

We thank Professor Alexandre S. Martinez for critical reading of the 323 manuscript.