key: cord-0859495-dpawfnhw
authors: Ivorra, B.; Ferrández, M. R.; Vela-Pérez, M.; Ramos, A. M.
title: Mathematical modeling of the spread of the coronavirus disease 2019 (COVID-19) taking into account the undetected infections. The case of China
date: 2020-04-30
journal: Commun Nonlinear Sci Numer Simul
DOI: 10.1016/j.cnsns.2020.105303
sha: 096b4e8b2eb39917f0017276339be437650cc797
doc_id: 859495
cord_uid: dpawfnhw

Abstract In this paper we develop a mathematical model for the spread of the coronavirus disease 2019 (COVID-19). It is a new θ-SEIHRD model (not a SIR, SEIR or other general purpose model), which takes into account the known special characteristics of this disease, as the existence of infectious undetected cases and the different sanitary and infectiousness conditions of hospitalized people. In particular, it includes a novel approach that considers the fraction θ of detected cases over the real total infected cases, which allows to study the importance of this ratio on the impact of COVID-19. The model is also able to estimate the needs of beds in hospitals. It is complex enough to capture the most important effects, but also simple enough to allow an affordable identification of its parameters, using the data that authorities report on this pandemic. We study the particular case of China (including Chinese Mainland, Macao, Hong-Kong and Taiwan, as done by the World Health Organization in its reports on COVID-19), the country spreading the disease, and use its reported data to identify the model parameters, which can be of interest for estimating the spread of COVID-19 in other countries. We show a good agreement between the reported data and the estimations given by our model. We also study the behavior of the outputs returned by our model when considering incomplete reported data (by truncating them at some dates before and after the peak of daily reported cases). By comparing those results, we can estimate the error produced by the model when identifying the parameters at early stages of the pandemic. Finally, taking into account the advantages of the novelties introduced by our model, we study different scenarios to show how different values of the percentage of detected cases would have changed the global magnitude of COVID-19 in China, which can be of interest for policy makers.

Modeling and simulation are important decision tools that can be useful to control human and animal diseases [1, 13, 22, 42] . However, since each disease exhibits its own particular biological characteristics, the models need to be adapted to each specific case in order to be able to tackle real situations [3, 48] .

Coronavirus disease 2019 (COVID- 19) is an infectious disease emerging in China in December 2019 that has rapidly spread around China and many other countries (see [45] ). On 11 February 2020, the World Health Organization (WHO) renamed the epidemic disease caused by 2019-nCoV as strain severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) (see [12, 31] ). This is a new virus and a completely new situation [46] . On 30 January 2020, WHO declared it to be a Public Health Emergency of International Concern [35] . As of 11 March 2020, the disease was confirmed in more than 118,000 cases reported globally in 114 countries, more than 90 percent of cases are in just four countries (two of those -China and the Republic of Korea -have significantly declining epidemics) and WHO declared it to be a pandemic, the first one caused by a coronavirus [30] . On 1 April 2020 there are 872481 and 43275 official reported cases and deaths, respectively, and there is no vaccine specifically designed for this virus, with proven effectiveness.

There are some mathematical models in the literature that try to describe the dynamics of the evolution of COVID-19. Three phenomenological models are presented in [38] , which were validated with outbreaks of other diseases different from COVID-19, trying to generate and assess short-term forecasts of the cumulative number of reported cases. Other works (see e.g. [16] ), propose SEIR type models with little variations and some of them incorporate stochastic components. COVID-19 is a disease caused by a new virus, which is generating a worldwide emergency situation and needs a model taking into account its known specific characteristics. In particular, it would be convenient to develop a model which incorporates the following:

• the effect of undetected infected people (see [18] ), being able to show the dependence of the impact of COVID- 19 on the percentage of detected cases over the real total infected cases. Here, we denote by θ the fraction of detected cases over the real total infected cases. We note that there exist in the literature various works that propose estimations of θ in several countries (including China) for the COVID-19 pandemic (see, e.g., [18, 40] ).

• the effect of different sanitary and infectiousness conditions of hospitalized people (differentiating those with mild and severe conditions that will recover from those who will finally die),

• the estimation of the needs of beds in hospitals (which is one of the major problems for policy makers addressing .

The main goal of this paper is to develop a mathematical model well adapted to COVID-19 (not a SIR, SEIR or other general purpose epidemic models), taking into account the special characteristics mentioned previously. The model should be able to estimate, considering different scenarios, the number of cases, deaths and needs of beds in hospitals, in territories where COVID-19 is (or may be) a very serious health problem. It needs to be complex enough to capture the most important effects, but also simple enough to allow an affordable identification of its parameters, using the data that authorities report on this pandemic.

The θ-SEIHRD model developed here is based on the Be-CoDiS model (see [14] ), designed to be able to study the spread of human diseases worldwide. That model was initially used for the 2014-2016 ebola outbreak (see [14] ) and has also been used for the 2018-2020 ebola outbreak in the Democratic Republic of the Congo (see [6, 7] ), in both cases with very successful forecasts. Other works of our group related with the mathematical modeling of epidemics can be seen on the website https://www.ucm.es/momat/epidemics. In this paper, we have adapted that model to the specific case of COVID-19. Only within-country disease spread is considered in this paper for territories with a relevant number of people infected by COVID-19, where local transmission is the major cause of the disease spread. We point out that, when available, the parameters of the model are taken from the literature (see, e.g. [18, 19] ). Undocumented parameters or specific parameters to our approach are identified by using a multi-objective optimization approach, developed in a previous work (see [7] ). We point out that we use a novel approach that relates the disease fatality rate with the percentage of detected cases over the real total infected cases, which allows to study the importance of this percentage on the impact of COVID-19.

Next, we validate our approach by studying the specific case of China (including the Chinese Mainland, Macao, Hong-Kong and Taiwan, as done by the World Health Organization in its reports on COVID-19; see [29] ), which is the country with more data available to date. A good identification of the model parameters for this study case may be of interest for the application of the model to other territories. With that aim, we discuss and compare the results returned by our model with data reported by authorities on COVID-19 (see, e.g. [25, 27, 28] ). In particular, we describe the evolution of the effective reproduction number R e and compare the values with those found in the literature (see e.g. [20] ).

We also study the behavior of the outputs returned by our model when considering incomplete data (truncated at some dates before and after the peak of daily reported cases). By comparing those results with real observation data we can estimate the error produced by the model when identifying the parameters at early stages of the epidemic.

Finally, we study different scenarios to show how different values of the percentage of detected cases would have changed the impact of COVID-19 in China, which can be of interest for policy makers.

In this section, we describe the θ-SEIHRD model. First of all, we present the epidemiological characteristics of COVID-19. Then, we introduce a general and a detailed description of our approach. Finally, we detail some of the model outputs that will be used for the numerical experiments performed later.

According to the known characteristics of the COVID-19 pandemic, we assume that each person is in one of the following compartments (see [46] ):

• Susceptible (denoted by S): The person is not infected by the disease pathogen.

• Exposed (denoted by E): The person is in the incubation period after being infected by the disease pathogen, and has no visible clinical signs. The individual could infect other people but with a lower probability than people in the infectious compartments. After the incubation period, the person passes to the Infectious compartment I.

• Infectious (denoted by I): After the incubation period, it is the first compartment of the infectious period, where nobody is expected to be detected yet. The person has finished the incubation period, may infect other people and starts developing clinical signs. After this period, people in this compartment can be, either taken in charge by sanitary authorities (and we classify them as hospitalized), or not detected by authorities and continue as infectious (but in other compartment, I u ).

• Infectious but undetected (denoted by I u ): After being in the compartment I, the person can still infect other people, have clinical signs but is not be detected and reported by authorities. We assume that only people with low or medium symptoms can reach this compartment, not the people who will die. After this period, people in this compartment pass to the Recovered compartment R u .

• Hospitalized or in quarantine at home (but detected and reported by the authorities) that will recover (denoted by H R ): The person is in hospital (or in quarantine at home) and can still infect other people. At the end of this state, a person passes to the Recovered compartment R d .

• Hospitalized that will die (denoted by H D ): The person is hospitalized and can still infect other people. At the end of this state, a person passes to the Dead compartment.

• Dead by COVID-19 (denoted by D): The person has not survived the disease.

• Recovered after being previously detected as infectious (denoted by R d ):

The person was previously detected as infectious, survived the disease, is no longer infectious and has developed a natural immunity to the virus. When a person enters this compartment he/she remains in hospital for a convalescence period of C o days (average).

• Recovered after being previously infectious but undetected (denoted by R u ): The person was not previously detected as infectious, survived the disease, is no longer infectious and has developed a natural immunity to the virus.

The authorities may apply various control measures in order to control the COVID-19 spread (see [4] ):

• Isolation: Infected people are isolated from contact with other people. Only sanitary professionals are in contact with them. However, contamination of those professionals also occurs (see [32] ). Isolated patients receive an adequate medical treatment that reduces the COVID-19 fatality rate.

• Quarantine: Movement of people in the area of origin of an infected person is restricted and controlled (e.g. quick sanitary check-points at the airports) to avoid that possible infected people spread the disease.

• Tracing: The objective of tracing is to identify potential infectious contacts which may have infected a person or spread COVID-19 to other people. Increase the number of tests in order to increase the percentage of detected infected people.

• Increase of sanitary resources: The number of operational beds and sanitary personal available to detect and treat affected people is increased, producing a decrease in the infectious period for the compartment I.

The model is used to evaluate the spread of a human disease within some territories during a fixed time interval. At the beginning of the simulation, the model parameters are set by the user (for instance, the values considered for COVID-19 are described in Section 3). We consider as time t = 0 the 1 December 2019 (7 days before the date that appears in the literature as the most probable date for the index case in China. Here we took into account that, according to the World Health Organisation's website, the first confirmed COVID-19 case in China was on December 8 (see [34] ). Furthermore, according to [46] , the earliest symptom onset of confirmed patients can be traced back to 7 December 2019. We set then t = 0 that day at 10AM CET (this is a technical adjustment, since each day the WHO provides data as reported by national authorities by that hour; see [29] ).

We can start our simulation at any initial time t 0 ≥ 0, considering that only susceptible people live in the territories that are free of the disease, whereas the number of people in compartments S, E, I, I u , H R , H D , R d , R u , and D of the infected territories are set to their corresponding values. Then, during the time interval [t 0 , t 0 + T max ], with T max ∈ IN being the maximum number of simulation days, the within-country daily spread procedures (described in Section 2.3) are applied. If at the end of a simulation day t there is less than 1 person in each compartment E, I, I u , H R and H D in all the territories considered, the simulation is stopped. Else, the simulation is stopped when t = t 0 + T max . Furthermore, the control measures are also implemented and they can be activated or deactivated, when starting the model, in order to quantify their effectiveness to reduce the magnitude and duration of a COVID-19 epidemic.

A diagram summarizing the main structure of our full model is presented in Figure 1 . After describing the full model, we will present a simplified version (the one we will use for our simulations), with an also simpler diagram (see Figure 2 ).

The choice of using a deterministic model instead of a stochastic one is done as a first approach, since such kind of models present some advantages, such as: a low computational complexity allowing a better calibration of the model parameters or the possibility of using the theory of ordinary differential equations for suitably analyzing and interpreting the model. Furthermore, according to [5] , deterministic models should be the first tool to be used when modelling a new problem with few data. The authors of that work also note that the stochastic models are not suitable when it is difficult or impossible to determine the distribution probability, are difficult to analyze and require more data for the calibration of the model.

As said above, in this work we will only consider the within-country disease spread of territories where, starting from suitable values of t 0 , the COVID-19 pandemic is already spreading by its own, with a relatively negligible dependence of the international movement of people. The between-country disease spread will be considered in another work.

The dynamic disease spread within a particular contaminated territory i is modeled by using a deterministic compartmental model (see, for instance, [3] ).

We assume that people in a territory are characterized to be in one of those compartments, described in Section 2.1: S, E, I, I u , H R , H D , R d , R u or D. For the sake of simplicity we assume that, at each time, the population inside a territory is homogeneously distributed (this can be improved by dividing some territories into a set of smaller regions with similar characteristics). Thus, the spatial distribution of the epidemic inside a territory is omitted. We also assume that new births are susceptible people. We do not consider here movement of people between territories.

Under those assumptions, the evolution of the compartments mentioned above is modeled by the following system of ordinary differential equations (which is simplified below):

where:

• i ∈ {1, . . . , N C }, with N C ∈ N being the number of countries/territories/areas considered.

• N (i) is the number of people in territory i before the start of the pandemic.

• µ (i) n ∈ [0, 1] is the natality rate (day −1 ) in territory i: the number of births per day and per capita.

• µ (i) m ∈ [0, 1] is the mortality rate (day −1 ) in territory i: the number of deaths per day and per capita (or, equivalently, the inverse of the mean life expectancy (day) of a person).

• ω (i) (t) ∈ [ω (i) , ω (i) ] ⊂ [0, 1] is the case fatality rate in territory i at time t: the proportion of deaths compared to the total number of infectious people (detected or not detected). Here, ω (i) and ω (i) are the minimum and maximum case fatality rates for territory i, respectively. 

Iu I Figure 1 : Diagram summarizing the model for COVID-19 given by system (1) .

is the fraction of infected people that are detected and documented by the authorities in territory i at time t. We are assuming, for the sake of simplicity, that all deaths due to COVID-19 are detected and reported, so that θ (i) (t) ≥ ω (i) . Estimations of θ for the COVID-19 pandemic in several countries (including China) can be found in the literature (see, e.g., [18, 40] ).

• β

HD ∈ R + are the disease contact rates (day −1 ) of a person in the corresponding compartments, in territory i (without taking into account the control measures).

• β (i) Iu (θ) ∈ R + is the disease contact rate (day −1 ) of a person in the compartment I u in territory i (without taking into account the control measures), where the fraction of infected people that are detected are θ.

• γ E ∈ (0, +∞) is the transition rate (day −1 ) from compartment E to compartment I. It is the same for all the territories.

• γ (i)

HD (t) ∈ (0, +∞) denote the transition rates (day −1 ) from compartments I u , H R and H D to compartments R u , R d and D, respectively, in territory i at time t.

• m (i) • τ (i) 1 (t) is the people infected that arrives to territory i from other territories per day. τ (i) 2 (t) is the people infected that leaves territory i per day. Both can be modeled following the between-country spread part of the Be-CoDiS model, see [14] ).

We point out that the 9th equation of system (1) is not coupled with the other equations. Thus, we can solve the first eight equations of that system and the solution of the last one can be computed as follows:

From a modeling point of view, the term ω (i) (t) θ (i) (t) in the 5th equation of system (1) corresponds to the apparent fatality rate of the disease (obtained by considering only detected cases) in territory i at time t, and ω (i) (t) is the real fatality rate of the COVID-19 (taking into account detected and undetected cases).

We could have also considered sub-compartments organized by ages, etc., but that would complexify the model and would hinder the identification of parameters. For instance, we could use in each compartment a different natality rate for the people infected (according to their age distribution). Actually, since the natality and mortality (not from COVID-19) do not seem to be important factors for COVID-19 (at least for relatively short periods of time) one may consider the simplified model with µ m = µ n = 0. We will consider this case in the rest of this paper. The corresponding new diagram summarizing the main structure of the simplified model can be seen in Figure 2 and the resulting system equations, after removing the index i denoting different territories (for the sake of simplicity), is given by the system (3) 

We remark that, with this simplification the 7th, 8th and 9th equations of system (3) are not coupled with the other equations. Thus, we can solve the first six equations of that system and the solution of the last three equations can be computed by using (2) and

For the numerical simulations presented in Section 4, the first six equations of system (3) were numerically solved with the classic 4 th order Runge-Kutta method with four stages (RK4) and with four hours as time step (which was tested to see that it is suitable to get stable results). 

Here, we present the outputs of the mathematical model (for territory i; we continue avoiding the use of that index in the notation, for the sake of simplicity), used to analyze the results of the simulations performed in Section 4:

• c m (t): The model cumulative number of COVID-19 cases (in country i) at day t, which is given by

and can be also computed as follows:

The model cumulative number of deaths (due to , at day t (in territory i), which is given by D(t).

• R 0 and R e : The basic reproduction number and the effective reproduction number of COVID-19 (for territory i). The basic reproduction number is defined as the number of cases one infected person generates on average over the course of its infectious period, in an otherwise uninfected population and without special control measures.

It depends on the considered population (therefore, it may be different for different territories), but does not change during the spread of the disease.

The effective reproduction number is defined as the number of cases one infected person generates on average over the course of its infectious period. Part of the population can be already infected and/or special control measures may have been implemented. It depends on the considered population and changes during the spread of the disease. Furthermore, R e (0) = R 0 . Typically, the spread of the disease slows down when R e (i, t) < 1.

We apply the Next Generation Matrix method (see [43] ) to system (3) and obtain that:

where, for the sake of simplicity of notation, all previous coefficients correspond to their particular values at time

and, again in order to simplify the notation, all previous coefficients correspond to their particular values at time t.

• Hos(t): The number of people in hospital is estimated as follows:

where p(t) is the fraction, at time t, of people in compartment H R that are hospitalized and C o days is the period of convalescence (i.e. the time a person is still hospitalized after recovering from . This function can help to estimate and plan the number of clinical beds needed to treat all the COVID-19 cases.

• MHos: The maximum number of hospitalized people at the same time (in territory i) during the time interval [t 0 , T ]. It is computed as:

This number can help to estimate and plan the number of clinical beds needed to treat all the COVID-19 cases.

• Γ E (t), Γ Iu (t) and Γ H (t): The number of people infected during the time interval [t 0 , T ], by contact with people in compartment E, I u and H = H R + H D , respectively. They are given by:

We point out that c m (t) and d m (t) can be compared with the corresponding values reported by WHO (see [32] )

Some of the parameters used in the simulations presented in Section 4 have been found in the literature. However, despite the effort to use the maximum amount of robust parameters as possible, and due to lack of information of the behavior of the SARS-CoV-2, some of them have been estimated using empirical assumptions or techniques for parameter identification. This part should be improved as soon as new information is available. We now detail each kind of parameter by its category.

We have built a database where we have recorded the following data, that we use to estimate the parameters of the model and to compare with the outputs of the numerical simulations presented in Section 4 We obtain the countries' population (N) for year 2018 from the World Data Bank, downloaded on 5 March 2020 (see [2] ).

There are some studies about the relevance of humidity and temperature for the spread of COVID-19. In [21] , it is shown that the observed patterns of COVID-19 are not completely consistent with the hypothesis that high absolute humidity may limit the survival and transmission of this virus. Furthermore, in [47] it has been found that the lower is the temperature, the greater is the survival period of the SARS-CoV-2 outside the host. Since there is no clear scientific evidence of the effect of the humidity and the temperature on the SARS-CoV-2, we have not included these two factors in our model (this would need to be revised in case of appearing new findings regarding this topic).

Furthermore, from the WHO Coronavirus disease (COVID-2019) situation reports (see [29] ), we have obtained the reported cases per day (i.e. people who enter each day in compartments H R or H D ) and the reported COVID-19 deaths per day. For each day t, they are denoted by c r (t) and d r (t), respectively. We have those data starting from 21 January 2020. Data from 12 to 21 January 2020 were obtained from the following website:

https://bnonews.com/index.php/2020/02/the-latest-coronavirus-cases/ We assume that the index case started his incubation period in China on 1 December 2019, as the earliest symptom onset of confirmed patients can be traced back to 7 December 2019 (see [46] ). In order to complete the missing information for the days without data between 8 December 2019 and 10 January 2020, we use a cubic hermite interpolation method considering the fact that before 8 December 2019 no cases were reported.

Data given in Table 1 were calibrated for the cases in China, the first big source of COVID-19. However, due to the spread of this disease, new studies should be performed to analyze the behavior of COVID-19 in other sanitary, population and climatic conditions. Currently, very few studies accepted for the scientific community are available. 

Focusing on the application of the control measures, we multiply the disease contact rates (i.e. β E , β I , β Iu , β HR and β HD ) by decreasing functions simulating the reduction of these rates as the control measures efficiency is increased.

Here, we have considered the functions (see [17] ):

where, for every j ∈ {0, ..., q}, m j ∈ [0, 1] measures the intensity of the control measures (greater value implies lower value of disease contact rates), κ j ∈ [0, +∞) (day −1 ) simulates the efficiency of the control measures (greater value implies lower value of disease contact rates) and λ j ∈ [t 0 , ∞] denotes the first day of application of each control strategy (λ 0 ∈ [0, ∞] is the first day of application of a control strategy that was being used before t 0 , if any). Here, q ∈ N is the number of changes of control strategy.

The values of λ j are typically taken from the literature (using the dates when the territories implement special control measures). Some of the values of m j can be also sometimes known. The rest of the parameters need to be calibrated as explained in Section 3.7.

The case fatality rate ω(t) depends on the territory i and time t. We recall that ω is the proportion of deaths compared to the total number of infectious people (i.e. including detected and undetected people). As observed in other epidemics (see, e.g. [14] ), it can be affected by the application of control measures (such as: earlier detection, better sanitary conditions, etc.). Thus, we propose to consider the following definition:

where ω ∈ [0, 1] is the case fatality rate when no control measures are applied (i.e. m I (t) ≡ 1); and ω ∈ [0, 1] is the case fatality rate when the implemented control measures are fully applied (i.e. m I (t) ≡ m q for t ∈ [λ i−q , λ q )).

Constants ω and ω are estimated with the calibration process mentioned in Section 3.7. We note that ω ≥ ω, thus, we consider ω = ω + δ ω , with δ ω ∈ [0, 1 − ω].

We denote by d I , d Iu , d HR and d HD the (average) duration in days of a person in compartment I, I u , H R and H D , respectively, without the application of control measures. Additionally, we consider that:

• According to [18] , the transition rate from E to I depends only on the disease and, therefore, is considered constant.

• According to [18, 19] , the value of γ I (t) can be increased due to the application of control measures (i.e. people with symptoms are detected earlier). As a consequence, the values of γ Iu (t), γ HR (t) and γ HD (t) can be decreased (e.g. people with symptoms stay under observation during more time).

• According to [18] , d HR is lower than d HD . Thus, we consider d HD = d HR + δ R , where δ R > 0.

• We assume, for the sake of simplicity, that the infectious period for undetected people is the same than that of the hospitalized people that survive the disease. Thus, d Iu = d HR . Furthermore, we assume that the addition of the times a person is in the compartments I and I u is constant. Therefore, following [14] , we use the following functions:

where g(t) = d g (1−m I (t)) represents the decrease of the duration of d I due to the application of control measures, at time t; and d g is the maximum number of days that d I can be decreased due to the control measures.

Regarding β Iu , we may consider

where β I and β I are suitable lower and upper bounds, respectively. For the sake of simplicity, here we take β I = β I . Furthermore, following the idea proposed in [14, 18] , we assume that there exists a relationship between the contact rates β I , β E , β HR , β HD and β I .

More precisely, we assume that people in compartments E, H R , H D and I u are less infectious than people in compartment I (due to their lower virus load or isolation measures, see [18] ). Thus, we consider that β E = C E β I , β HR (t) = β HD (t) = C H (t)β I and β I = C u β I , where C E , C H (t) and C u ∈ [0, 1]. In Section 4.1, we show how to compute the function C H (t) for the case of China (something similar can be done for other territories).

First, we determine the set of unknown model parameters to be identified by the approach presented here. We denote this set by Ω. The specific parameters considered to be included in Ω will depend on the available information about the disease in the considered territory. For the case of China considered in this work, we list the parameters in Ω at the end of Section 4.1.

Once Ω is defined, we follow the methodology proposed in [7] . More precisely, we define a multi-objective optimization problem based on the minimization of the relative errors between the cumulative number of cases and deaths reported by the authorities (i.e. c r and d r ) and the cumulative number of cases and deaths estimated by the model when considering a particular set of values for Ω (denoted by c Ω m and d Ω m , respectively). To do so, we consider the following multi-objective optimization problem:

where S is the feasible region (i.e. upper and lower bounds of each parameter), n is the number of parameters in Ω, and f c and f d are the objective functions defined as:

where g L 2 (t0,t0+Tmax) = t0+T max t0 g(t) 2 dt 1/2 denotes the L 2 norm of function g in [t 0 , t 0 + T max ]. We note that the choice of this norm reduces the risk of over-fitting to the reported data [37] . Next, to solve this multi-objective optimization problem, the algorithm called Weighting Achievement Scalarizing Function Genetic Algorithm (WASF-GA) is applied [8, 9, 39] . We choose this particular multi-objective algorithm as it has been applied with success in a previous work to identify some parameters of an epidemiological model simulating the evolution of Ebola Virus Disease outbreaks in Democratic Republic of the Congo [9] .

From a general point of view, WASF-GA is an evolutionary algorithm that manages a population of possible solutions (called individuals), trying to improve them iteratively towards the optimal set (called Pareto optimal set). To do that, it uses two procedures at each iteration: reproduction and replacement. Firstly, the reproduction method applies some classical genetic operators (here: selection, mutation, and crossover) to the population of the previous iteration to obtain a new offspring population. Secondly, the replacement procedure decides which solutions from the previous iteration and the offspring populations should compose the population for the next iteration. The main feature of WASF-GA is that this decision is based on preferences, which are defined as reference values for the objectives. Indeed, it classifies the individuals using an achievement scalarizing function (see [39] ) characterized by the L ∞ distance between the objective function values and the reference values with different weights. The considered sample of weight vectors is built following [39] to ensure they are well distributed. As a result, WASF-GA returns a set of solutions covering a region of the Pareto optimal front.

In this work, we set the number of iterations to 15000, the population size to 50 individuals, the number of weight vectors to 50, and the reference values for the objectives to zero, corresponding to the ideal situation of a total matching between the model outputs and the reported data. The remaining input parameters of WASF-GA are those reported in [39] .

Since WASF-GA is a stochastic algorithm, it may return different Pareto front approximations for different executions of the same instance. Then, to guarantee that results do not depend on the stochasticity of the algorithm, we perform 60 runs of each experiment. Among all the 60 sets of solutions, we extract the most balanced solution, meaning the one whose objective values are the closest to the reference values (in our case, zero) when considering the Euclidean distance.

We now propose some numerical experiments to illustrates the efficiency of our approach by considering the particular case of the COVID-19 in China. First, we detail the considered parameters related to this scenario. Then, we present and analyze the obtained results.

We performed several numerical simulations for the case of China. For all the cases below we have used N = 1.400.812.636 people and p(t) ≡ 1, as it was decided to hospitalize all cases to reduce onward transmission (see [44] ). Following [18, 19] , we set d E = 5.5 days (i.e. γ E = 1/5.5 days −1 ), d I = 6.7 days, d g = 6 days, d Iu =14−d I = 7.3 days (i.e. γ Iu = 1/7.3 days −1 ).

For β Iu , we use the linear relationship

Additionally, according to [33] , on 20 February 2020, 2055 healthcare workers were infected in China. At this date, 74651 cases were reported in China. From this data, we assume that the infection due to contact with people in compartment H R or H D should represent around 100 2055 74651 % = 2.75% of the number of cases. Thus, we compute β HR (t) = β HD (t) = C H (t)β I such that

This implies that,

In the previous expression, the coefficients depend on territory i and time t.

Following data reported by the WHO, we have aggregated data from the Chinese Mainland, Macao, Hong-Kong and Taiwan. Additionally, we assume that the movement of people in and out of China is negligible in order to estimate the number of cases and deaths in China. Therefore, for the sake of simplicity, we assume in this section that τ 1 = τ 2 ≡ 0.

Reported data for China have some inconveniences for modeling purposes. On 13 February 2020, the Hubei (the Chinese province with the largest number of cases and deaths) authorities decided to report as infected not only people who was positive in laboratory tests, but also clinically diagnosed cases in that province (see [25] ). This supposed a sudden increase of around 15000 cases in one day (see [15] ).

In addition, up to 16 February 2020, the WHO decided to continue reporting only cases confirmed with laboratory tests in its situation reports (see [24] ). On 17 February 2020, the WHO changed its protocol to count cases and decided (see [27] ) to include in its reports the data provided by the Hubei province. This supposed a sudden increase of around 20000 cases in one day (see [25] ).

On 21 February 2020 China (and WHO) informed that they have "revised their guidance on case classification for COVID-19, removing the classification of 'clinically diagnosed' previously used for Hubei province, and retaining only 'suspected' and 'confirmed' for all areas, the latter requiring laboratory confirmation. Some previously reported 'clinically diagnosed' cases are thus expected to be discarded over the coming days as laboratory testing is conducted and some are found to be COVID-19-negative" (see [28] ).

Thus, different criteria have been used in the official reports, with even different data in Chinese and WHO reports from 13 February 2020 up to 16 February 2020. This hinders the calibration of our mathematical model and makes very difficult an accurate forecast. Therefore we have filtered the data to get more reliable values.

In order to smoothly distribute the sudden increase of 17410 cases in the number of cases reported before the 17 February 2020, we distributed those 17410 cases to previous dates according to the following formula:

where c ar (i, t) denotes the adjusted number of reported cases in country i at time t, t =12 January 2020 is the date of the first available data and t a =16 February 2020. This filtered function is the one used for the identification of parameters mentioned in Section 3.7.

This filtering procedure highlights the fact that reported data could be inconsistent. In this case, since our model reproduces the biology of the pandemic, if reported data are incorrect (with the exception of small noise) and are not consistent with the biology of the pandemic, the model would not be able to fit the data in a reasonable way. Thus, the use of this model could asses the fact that some data are reported incorrectly.

We show in Figure 3 , the evolution of c r and c ar in China from 12 January 2020 up to 28 March 2020. According to [18] , 86% of all infections were undocumented prior to 23 January 2020, when travel restrictions where applied in Wuhan. Furthermore, after Wuhan's closure measures (later extended to much of China), the dynamics of the epidemic changes. Between 24 January 2020 and 8 February 2020, it is estimated that the percentage of invisible infected people drops to 35% (see [18] ). Therefore, we consider here that the numbers of deaths that we have in our database, reported by authorities, are very close to the real ones, but the real total cases are underestimated. Actually, according to the percentages given above, we take 

Thus, even if the reported deaths d r (t) can be considered a good estimation of the real numbers, neither the reported cases c r (t) nor the adjusted reported cases c ar (t) described above would be a good estimation of the real number of infected cases c real (t), which can be computed as

. In order to effectively prevent further exportation of infected individuals to the rest of the country, the Chinese government decided to impose a cordon sanitaire (i.e. all public transportation and all outbound trains and flights were halted) around Wuhan and neighboring municipalities on 23 January 2020 ( [29] ). Therefore, we considered the implementation of control measures after this date, in order to represent the real situation of the measures imposed:

where λ 1 =23 January 2020 and the rest of parameters need to be identified. Using the multiobjective optimization process mentioned in Section 3.7, we determine the following parameters: β I , C E , C u , δ R , δ ω , ω and κ 1 . All those parameters are bounded according to values found in the literature (see [18] ). In Table 2 , we recall the meaning of those parameters and their range of values considered during the optimization process for the particular case of COVID-19 in China.

From a numerical point of view, as explained above, we use the classic 4 th order Runge-Kutta method with four stages (RK4) and with four hours as time step in order to approximate the solution of system (3). The numerical model and the algorithm WASF-GA are implemented in Java. Experiments are performed in a computer with an I7-4790 3.6Ghz CPU and 32Gb of Ram. One execution of the model takes around 0.01 seconds and the full parameter identification process takes around 660 seconds.

We present and discuss the results returned by our model, considering the parameters that were fixed for the case of China and the parameters to be identified (presented in Section 4.1).

In particular, to show the robustness of the proposed approach, we have run the experiment by considering data c ar (t) and d r (t), from the index case up to different values of the final date t f . We set t f to the following particular • t f =29 March 2020: this date corresponds to that of the last data available before running the numerical experiments. We denote this experiment by EXP 29M .

• t f =25 February 2020: this date is a representative date for which the epidemic in China was decreasing (i.e. the daily number of reported cases was decreasing during several days). We denote this experiment by EXP 25F .

• t f =8 February 2020: this date is considered as the inflection point for which the daily number of reported cases starts to decrease, we denote this experiment by EXP 08F .

• t f =29 January 2020: this date is a representative date for which the epidemic in China was growing (i.e. the daily number of reported cases was increasing during several days), we denote this experiment by EXP 29J .

The objective is to show how the model and the parameter estimation methodology deal with incomplete data. This study also illustrates the robustness of our approach to the over-fitting of the model parameters with respect to the reported data.

In Table 3 , we show the values of the optimized parameters obtained at the end of each experiment. From those results, on the one hand we observe that the values of β I , C E , C u , ω, δ ω and κ 1 show a progressive evolution to the value obtained with EXP 29M , when increasing the amount of available data for fitting the model to observed data. There is a noticeable difference between the values returned at the end of EXP 29M and EXP 29J . This indicates that a reasonable amount of data should be available to estimate well some of the model parameters. Estimating those parameters at an early stage of the epidemic could produce poorly consistent results.

Additionally, for parameters C u , ω, δ ω and κ 1 , we observe that the results returned by EXP 29M and EXP 25F are quite similar.

Additionally, focusing on δ R , we observe that the value generated with EXP 29M and EXP 08F clearly differs from the one obtained with EXP 25F and EXP 29J . We note that this particular parameter was already reported in the literature (see, e.g. [18] ) as being difficult to be estimated (considering a range of one month). Again, this shows that the estimation of some key parameter could change dramatically during an epidemic and re-calibration should be performed periodically to capture the recent behavior of the epidemic.

In Figure 4 , we show the evolution of the number of cases and deaths in China from 1 December 2019 to 29 March 2020 for the reported data (i.e. c ar (t) and d(t)) and for the results obtained at the end of experiments EXP 29M , EXP 25F , EXP 25F and EXP 29J .

Focusing on the number of cases, we observe that results obtained with EXP 29M , EXP 25F reproduce quite accurately the evolution of the number of cases. In particular, we found a final number of cases around 80980 and 83800, considering EXP 29M and EXP 25F , respectively (corresponding to an error of less than 1.8% with respect to the reported values). The final number of cases obtained with EXP 08F is around 65950, which correspond to an underestimation of -20% with respect to real reported data. Finally, we observe that EXP 29J forecasts poorly and Focusing on the number of deaths, it seems that all results do not capture efficiently the drastic decrease occurring at the end of February. However, the result returned by EXP 29M shows a slope similar to that of the reported deaths. In all cases, the obtained estimations are in a reasonable range compared to the reported final value (with an error of ±24%).

We now focus on the results returned at the end of Experiment EXP 29M to analyze the main key factor that may explain the dynamics of the COVID-19 epidemic in China up to 29 March 2020.

In Figure 5 , we show the evolution of the adjusted reported cases and the detected, undetected and total (i.e. detected plus undetected) cases estimated by our model. We observe that the final number of undetected cases (around 87100 cases) estimated by our model represents around 52% of the total number of cases (around 168100 cases). Furthermore, it seems that control measures have not contained their growing as fast as for the detected cases. These results are interesting as they indicate that, despite the relative control of the epidemic in China, there may still exist an undetected source of infected people that could cause an increase of the epidemic in a near future, if the control measures are significantly relaxed.

In Figure 6 , we show the evolution of Hospitalized and Recovered people estimated by our model and reported data. The model estimates that the peak of the number of people hospitalized at the same time has been reached around 17 February 2020, with 64704 hospitalized patients. According to data reported in https://www.worldometers.info/coronavirus/country/china/ the peak was reached on 17 February 2020 with around 58000 hospitalized people at the same time. The model estimates quite reasonably the number of hospitalized people. However, the obtained results slightly overestimate the real observations by around 11.6%. Focusing on Recovered people, our model returns a final number of 75100 people, which is very close to the real observation (around 75130 recovered people on 29 March 2020). This shows that the proposed methodology is a reasonable decision tool to estimate the number of beds in hospitals during an epidemic or pandemic.

Additionally, in Figure 7 , we show the evolution of Hospitalized people in China considering the adjusted reported cases (i.e. c ar (t) minus the number of reported deaths and recovered people at time t) and the results returned at the end of experiments EXP 29M , EXP 25F , EXP 08F and EXP 29J . We observe that in all cases, the estimated date for the peak of people in Hospital is close to the real observed date (17 February 2020) . This seems to indicate that our approach is able to detect early a reasonable approximated date for this peak. Focusing on the magnitude of this peak, EXP 29M , EXP 25F and EXP 08F give acceptable estimations (with error of ±13.8%). However, as expected, the magnitude of the peak returned with EXP 29J underestimates the reported data (with an error of -24%).

In Figure 8 , we present the evolution of the daily number of new people in the exposed compartment, new cases and new deaths in China from 1 December 2019 to 29 March 2020 considering the results returned at the end of Cumulative number of people 10 4 Rep. Cases (a.r.c.) Pred. Detected Cases Pred. Undetected Cases Total Pred. Cases of new reported deaths. These outputs are of interest to estimate the behavior of the epidemic curve (see [10] ), and they can be used to study the efficiency of control measures to flatten this curve.

Here, we observe that considering the daily number of new cases and new deaths, the predicted values fit reasonably well the reported ones (taking into account that there were several errors and approximations in the reported number of deaths, see, e.g. [26] ). Focusing on the number of new daily people in the exposed compartment E, we observe that the peak was reached on 27 January 2020, suggesting that the isolation measures taken in China on 23 January 2020 were efficient to quickly reduce the magnitude of the epidemic. The peak for the new daily cases and deaths (which is the visible part for the society) is delayed due to the transition rates between each compartment of the disease. Therefore, despite the effects of the control measures are immediate, this effect is visible for the society after a non negligible delay (in this case around two weeks).

In Figure 9 , we show the evolution of the effective reproduction number R e (t). We observe that this value decreases since the application of the control measures and it remains below 1 after 1 February 2020. These results are consistent with the fact that a decrease in the daily number of reported cases was observed at the beginning of February 2020.

In Table 4 , we present some important parameters and threshold values related to this epidemic. In particular, we observe that in our case the basic reproduction ratio R 0 is 4.2732, which is bigger than other values reported in the literature (between 2.5 and 3.0, see [18] ). This could be explained by the fact that we have taken into account the undetected people.

Additionally, we observe that people in compartment E represent around 26% of the infections. People in compartment I u caused around 37% of infections. Those results suggest that people with no or few symptoms were the major source of infection. However, it seems that our model underestimates the infection due to hospitalized people with only 1% of the infections (it should be around 3%). A more precise approach should be proposed to estimate better the value of C H .

Finally, we analyze the impact of the detection parameter θ(t) on the global magnitude of the epidemic. With that aim, considering the model parameters obtained at the end of EXP 29M , we replace θ(t) by a constant function with Number of people 10 4 Rep. Recovered Pred. Recovered 

of affected people has a major impact on the global magnitude of the epidemic. The model estimates a reduction on the total amount of cases from 206500, when detecting only 14% of cases, to less than 2000 total cases, when detecting 100% of cases (which is an ideal case). Those results suggest that a detection campaign should be applied to reduce the impact of current COVID-19 epidemics. 

In this work, we developed a new θ-SEIHRD model adapted to the characteristics of COVID-19. This model considers some important characteristics of the COVID-19, such as the undetected infectious people and the effect of the control measures. In particular, it includes a novel approach that considers the fraction θ of detected cases over the real total infected cases, which allows to study the importance of this ratio on the impact of COVID-19. We also introduced a method to estimate some of the unknown parameters of this model. Additionally, we proposed the use of a filtered version of the data reported by the WHO, in order to smoothly distribute along the previous dates the sudden increase of 17414 cases reported on 17 February 2020. From a general point of view, the outputs returned by the simulation fit quite well the data reported by the WHO. In particular, they estimated reasonably the date and magnitude of the peaks corresponding to the number of new cases, new deaths and amount of hospitalized people. This indicates that the proposed methodology can be used as an useful decision tool for policy makers. However, considering an estimation performed at an early stage of the epidemic could produce poor results. Focusing on the study of the peaks of some relevant curves of the epidemic (e.g. new exposed people, new cases, new deaths and hospitalized people), results show that the control measures in China were efficient to quickly reduce the magnitude of the epidemic. However, due to the transition rates between the disease compartments, the visible impact for the society has a delay of two weeks after the beginning of those control measures.

From the model and simulations considered in this work, we estimate that the value of the basic reproduction number R 0 , for COVID-19 in China, is 4.2732. We note that this value is higher than the one usually reported in the literature (between 2.5 and 3 see, e.g. [18] ). This could be due to the fact that our model also estimates the amount of undetected cases, which increases the initial growth of the epidemic. Additionally, the effective reproduction number R e decreased, mainly due to the application of control measures, and reached values lower than 1 after 1 February 2020.

Considering the total cases, the model estimates that, including undetected cases, around 168100 people could have been infected in China. Those undetected cases could represent around 52% of the total number of cases. Furthermore, they may have caused around 37% of the total infections. These results are consistent with other conclusions found in the literature (see [36] ). Other studies found that asymptomatic or mild cases represent about 40% − 50% of all infections (see [23] ) whereas in [41] it is estimated that the percentage of symptomatic COVID-19 cases reported is 33%.

Finally, we also proposed a study of the impact of the percentage of detection of cases and obtained that the magnitude of the epidemic can be drastically reduced when increasing this percentage. This results could be used as a recommendation to countries currently actively affected by COVID-19.

However, we also note some limitations in the proposed methodology. We assume that the population inside the considered countries or territories is homogeneously distributed (this could be improved by dividing some of them into a set of smaller territories with similar characteristics). Thus, the spatial distribution of the epidemic inside a territory is not taken into account. Additionally, the current model is only March 2020, considering parameters from EXP 29M and the function θ given in (8) , and also scenarios in which θ(t) is given the constant values of 0.14, 0.25, 0.5, 0.75 and 1.

suitable for countries or territories with a relevant number of people infected by COVID-19, where local transmission is the major cause of the disease spread. The between-country spread has not been modeled in this work. Moreover, since there is no clear scientific evidence of the effect of the humidity and the temperature on SARS-CoV-2, we have not included these two factors in our model (this would need to be revised in case of new findings regarding this topic).

In this work we also point out the fact that we worked with official reported data, whose quality has been affected by: i) changes in guidelines to count cases in China (as mentioned above); ii) uncertainty about the number of undocumented infections; and iii) uncertainty about some of the characteristics of SARS-CoV-2.

[ https://www.who.int/news-room/detail/ 30-01-2020-statement-on-the-second-meeting-of-the-international-health-regulations-(2005) -emergency-committee-regarding-the-outbreak-of-novel-coronavirus-(2019-ncov), January 2020.

Population biology of infectious diseases: Part 1

World Bank Open Data

COVID-19 control in China during mass population movements at New Year. The Lancet

Understanding Infectious Disease Dynamics

Application of the Be-CoDis model to the 2018-19 Ebola Virus Disease outbreak in the Democratic Republic of Congo

A multi-objective approach to estimate parameters of compartmental epidemiological models. Application to Ebola Virus Disease epidemics. researchgate.net

High-performance computing for the optimization of high-pressure thermal treatments in food industry

Preference-based multi-objectivization applied to decision support for high-pressure thermal processes in food treatment

Using an epi curve to determine mode of spread

European Centre for Disease Prevention and Control. Discharge criteria for confirmed COVID-19 cases -When is it safe to discharge COVID-19 cases from the hospital or end home isolation?

Severe acute respiratory syndrome-related coronavirus: the species and its viruses -a statement of the

Mathematical formulation and validation of the Be-FAST model for classical swine fever virus spread between and within farms

Be-CoDiS: A mathematical model to predict the risk of human diseases spread between countries. Validation and application to the 2014 Ebola Virus Disease epidemic

Coronavirus COVID-19 Global Cases by the Center for Systems Science and Engineering (CSSE)

Early dynamics of transmission and control of COVID-19: a mathematical modelling study. The Lancet Infectious Diseases

Statistical inference in a stochastic epidemic seir model with control intervention: Ebola as a case study

Substantial undocumented infection facilitates the rapid dissemination of novel coronavirus (SARS-CoV2)

Transmission dynamics of 2019 novel coronavirus (2019-ncov). bioRxiv

The reproductive number of COVID-19 is higher compared to SARS coronavirus

The role of absolute humidity on transmission rates of the COVID-19 outbreak

A novel spatial and stochastic model to evaluate the within-and between-farm transmission of classical swine fever virus. I. General concepts and description of the model

Estimating the asymptomatic proportion of coronavirus disease 2019 (COVID-19) cases on board the Diamond Princess cruise ship

BNO news. Tracking coronavirus: Map, data and timeline

World Health Organization. Coronavirus disease 2019 (COVID-19) situation report -24

World Health Organization

World Health Organization

World Health Organization. Coronavirus disease 2019 (COVID-19) situation report -31

World Health Organization. Coronavirus disease (COVID-2019) Situation reports

Director-general's opening remarks at the media briefing on COVID-19 -11

Naming the coronavirus disease (COVID-19) and the virus that causes it

Report of the WHO-China Joint Mission on Coronavirus Disease

Report of the who-china joint mission on coronavirus disease

Rolling updates in coronavirus disease COVID-19

Covert coronavirus infections could be seeding new outbreaks

Machine Learning with R, the tidyverse, and mlr

Real-time forecasts of the COVID-19 epidemic in china from february 5th to february 24th

A preference-based evolutionary algorithm for multiobjective optimization: the weighting achievement scalarizing function genetic algorithm

Using a delay-adjusted case fatality ratio to estimate under-reporting. Report from the Centre for Mathematical Modelling of Infectious Diseases

Using a delay-adjusted case fatality ratio to estimate under-reporting. Centre for Mathematical Modeling of Infectious Diseases Repository

Reproduction numbers and sub-threshold endemic equilibria for compartmental models of disease transmission

Estimates of the severity of COVID-19 disease. medRxiv

A novel coronavirus outbreak of global health concern

Unique epidemiological and clinical features of the emerging 2019 novel coronavirus pneumonia (covid-19) implicate special control measures

Temperature Significantly Change COVID-19 transmission in 429 cities

The global dynamics for an age-structured tuberculosis transmission model with the exponential progression rate

This research was partially supported by the Spanish Government under projects MTM2015-64865-P and MTM2017-89423-P; and the research group MOMAT (Ref. 910480) supported by the Complutense University of Madrid. Also thanks to M. Ramos-Rubio for his help with the manuscript.

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

All authors contributed equally to the work.