key: cord-0654732-91c5atzz authors: T'atrai, D'avid; V'arallyay, Zolt'an title: COVID-19 epidemic outcome predictions based on logistic fitting and estimation of its reliability date: 2020-03-31 journal: nan DOI: nan sha: 6812e6b110ef0b50ed0c87f407cc327d78a32be4 doc_id: 654732 cord_uid: 91c5atzz Since the first outbreak of the COVID-19 epidemic at the end of 2019, data has been made available on the number of infections, deaths and recoveries for all countries of the World, and that data can be used for statistical analysis. The primary interest of this paper is how well the logistic equation can predict the outcome of COVID-19 epidemic in any regions of the World assuming that the methodology of the testing process, namely the data collection method and social behavior is not changing over the course of time. Besides the social relevance, this study has two scientific purposes: we investigate if a simple saturation model can describe the trend of the COVID-19 epidemic and if so, we would like to determine, from which point during the epidemic the fitting parameters provide reliable predictions. We also give estimations for the outcome of this epidemic in several countries based on the logistic model and the data available on 27 March, 2020. Based on the saturated cases in China, we have managed to find some criteria to judge the reliability of the predictions. The outbreak of COVID-19 is the third time that a zoonotic coronavirus has crossed species to infect humans in the past two decades after SARS-CoV and MERS-CoV those with rather high lethality rate [1] . This type of corona virus appeared first and become an epidemic in Wuhan, Hupei, China and quickly spread over the World with an initially estimated reproductive number of 2.2 [2] . From data analysis point of view, the speed of spreading is decoded in the available data points even if the epidemic is still going on and not started to saturate yet. Of course, at the very beginning of the process the accuracy of the fitting and predictions can be questioned. For a finalized epidemic, the curve for the cumulative number of infected cases mostly shows the behavior of the logistic growth as it is shown in Figure 1 which is the data for the Henan region, China. The cumulative number of infectious cases show such behavior that resembles to the well-known logistic growth, namely the outbreak has an exponentially increasing regime. This emerging segment is followed by an inflection point from that the number of daily infections starts to decrease. And finally, the cumulative number of cases is saturated, and no additional infection appears at the end of the process. The idea to use this approach to describe and predict the epidemic growth comes from Ref. [3] where the number of deaths during the epidemic process is analyzed for Italy this way. We similarly use the solution of the logistic equation in a different form (different mathematical basis) than in Ref. [3] to find the optimum parameters to the investigated dataset and calculate the expected duration and expected maximum number of infections. But in contrast we apply it for the number of registered infections instead of the number of deaths. Similar analysis, using a logistic fit was used in many additional works to predict the outcome of COVID-19 in selected regions recently [4] [5] [6] [7] [8] [9] [10] . We intend to extend this method to as much countries as possible and to make effort to describe the error of any prediction with the logistic model, and for that we introduce a variable to determine the reliability of the logistic fit. This parameter is evaluated based on the latest data point and the maximum number of cases given by the fitting. In this study, we analyze the trends of the spread of the COVID-19 epidemic by applying nonlinear least-square fitting method to get the best fit of the logistic curve to the available data. Our main purpose is trying to predict the evolution of the epidemic in several locations, but we note again that this work assumes that the testing methodology of the patients are not changing over time and the social behavior is also constant. For this analysis, we have used a publicly available database on the number of infected people by location [11] . The logistic growth model or Verhulst model, after the name of Belgian mathematician Pierre Verhulst, is used to describe biological systems in connection with the population growth among different restrictions (limited resources for growing). We use the solution of the logistic growth equation [12] to fit its parameters to the investigated dataset. The solution can be given also in the following form among others [13] where N is the number of infectious cases in this particular problem, t is the time, & is the cumulative number of infections at the end of the process or sometimes it is referred as the carrying capacity of the population, ( is the initial number of cases that we consider as 0 in every analyzed dataset, , is the center of the curve and p is the power parameter. From these variables, three of them have to be fitted in order to find an optimum fit to the particular dataset. These are the & , , and p parameters, respectively as a minimum number of parameters required for a saturation model. The nonlinear fitting method what we use to obtain the best fit parameters is the Levenberg-Marquardt (LM) method that is a nonlinear least-square fitting method [14, 15] . The results presented here were calculated by a custom-made code implemented in LabVIEW, by calling the LM fit routine. Learning from the Chinese epidemic processes, we introduce some additional parameters we investigate throughout this paper. First, the cumulative number of registered infections (CNRI) which is the number of the positive tested cases since the outbreak, namely the measured value of N(t) at the time t. Second, final value of the cumulative number of registered infections (FVCNRI) which is known only for saturated epidemics, like the ones in China. Ideally, this would be equal to & in the model if the fitting was excellent. Third, the estimated final value of cumulative number of registered infections (EFVCNRI) which is a numerically fitted value on CNRI, namely & in Eq. (1). We introduce the following ratio that we have found an essential parameter during the reliability investigations: where & is EFVCNRI while ( /01 ) is CNRI, the latest measured data point we have. This R value tells us where we are now in the logistic process if the fitted curve is well established. We have found, using ended epidemics from different regions of China that the logistic fitting gives different amount of error if we are at various parts of the logistic curve. The R value basically tells us the stability and reliability of the fitting and we have found that the prediction becomes reliable if R<3. We will discuss this in detail under the "Reliability of fit parameters" section. For all countries and regions, we have fitted the logistic model. Based on the obtained parameters, we have calculated the estimated date for 50% infections ( & /2) and the estimated date for 95% infections (0.95 • & ). For these, only the obtained parameters and the model description was used. The results including the R parameter are presented in Table 1 for all countries. We show an example for a non-saturated epidemic process; the data of Luxemburg is shown in Figure 2 . The logistic model fits really well on the dataset and the inflection point of the epidemic is at day 27, according to the calculations. Because the epidemic is in its first half in most locations, the reliability of the fitting parameters and so the derived parameters are questionable. Fortunately, for some regions in China, the epidemic seems to be saturated, so FCNVRI is known for those cases, and those datasets can be used to derive some reliability criteria for the found parameters of other locations too. For the saturated datasets, we have performed the logistic fit for the first 10; 11; 12… M-1; M day long data points (M = length of the dataset after the first registered infection) to estimate the behavior and convergence of the fit parameters. Now, we focus only on EFVCNRI, as the other parameters still need some additional investigations. We have observed two important facts: • At the beginning of the dataset ( ≈ 10), or a few days later, the EFVCNRI values usually 10-50 times bigger than FVCNRI, and from that time they start to converge smoothly (See Figure 3 ). Furthermore, from the extremely overestimated values, the converging time is relatively fast, it is in the order of 10-20 days. An example can be seen in Figure 4 . • A unified criterion has to be established to decide the reliability of the found parameters without knowing FVCNRI. We have previously introduced the R parameter, which can be calculated from the fit EFVCNRI ( & ) and the most recent CNRI ( ( :;< )) values. We have calculated this R parameter from the day number 10 to all available days in the datasets for China. We have found that the R parameter converges to 1 (the ideal value at the end of the epidemic) similarly as EFVCNRI converges to FVCNRI. And the two convergence curves have good correlation. Examples for Hainan and Heilongjiang regions of China can be found in Figure 3 . After investigating all saturated datasets from China, we have found the following set of criteria to judge the results: 1. If the dataset is over the "huge overestimation" region and 2>3 the reliability of the fit parameters will be very poor, but generally we can state that EFVCNRI is overestimated, and the overestimation factor can be 5-30 or even more. Based on these observations, we have calculated the R parameter for all countries and regions and included them in Table 1 . Furthermore, in all cases, where R>3 we have highlighted the corresponding row with red to emphasize the non-reliability of those data. In Table 1 , pure calculation results are presented, and no any correction based on the R value was applied. It is important to emphasize that such behavior seems to be a standard feature. Furthermore, this increase of the R parameter is an indicator that the dataset and the prediction of the logistic model is expected to become reliable soon. Data is for Hungary. While fitting the logistic model on the datasets for the various regions and countries listed in the dataset, many cases, the number of registered infections was too few or too "noisy" to be able to make any realistic fit on them. The data for those places is not published in this study and denoted with NaN. Some cases, the trend of the epidemic does not follow a logistic behavior such as in cases of Figure 5 and Figure 6 . Unfortunately, in some regions of China (Figure 6 ), after a saturated period, a second wave of the epidemic can be observed. Those regions will be studied later because such second waves can happen in the future in any other locations too. In all these kinds of cases, we provide NaN (Not a Number) as a result in Table 1 . As, the world is in the middle or in the first half of the epidemic, the authors are planning to continue their work, possibly with weekly updates and reanalysis of the updated dataset. Based on the present fit results, several countries are around the inflection point of the epidemic, and consequently, the EFVCNRI is already in the converging period with R~2 values. During the next week, EFVCNRI should continue converging, and in those regions, signs of the saturation might be observable. Furthermore, we expect that countries with R>10 values are also getting within the converging regime soon to predict the outcome of their epidemic process more accurately. A novel coronavirus (COVID-19) outbreak: a call for action Early Transmission Dynamics in Wuhan, China, of Novel Coronavirus-Infected Pneumonia Predicting the ultimate outcome of the covid-19 outbreak in Italy Logistic approximations used to describe new outbreaks in the 2020 COVID-19 pandemic A simplified model for expected development of the SARS-CoV-2 (Corona) spread in Germany and US after social distancing Short-term predictions of country-specific Covid-19 infection rates based on power law scaling exponents Prediction of number of cases expected and estimation of the final size of coronavirus epidemic in India using the logistic model and genetic algorithm Data analysis and modeling of the evolution of COVID-19 in Brazil On the Evolution of Covid-19 in Italy: a Follow up Note Solvable delay model for epidemic spreading: the case of Covid-19 in Italy Analysis of logistic growth models A Method for the Solution of Certain Non-Linear Problems in Least Squares An Algorithm for Least-Squares Estimation of Nonlinear Parameters The authors would like to express their appreciation for everybody fighting against the COVID-19 pandemic.