key: cord-0259430-jmrq2xlt authors: Wei, H.-L.; Billings, S. A. title: Modelling COVID-19 Pandemic Dynamics Using Transparent, Interpretable, Parsimonious and Simulatable (TIPS) Machine Learning Models: A Case Study from Systems Thinking and System Identification Perspectives date: 2021-11-01 journal: nan DOI: 10.1101/2021.11.01.21265653 sha: 1a933c87fba3e70a00cfb5ad4e959d07c3939aa5 doc_id: 259430 cord_uid: jmrq2xlt Since the outbreak of COVID-19, an astronomical number of publications on the pandemic dynamics appeared in the literature, of which many use the susceptible infected removed (SIR) and susceptible exposed infected removed (SEIR) models, or their variants, to simulate and study the spread of the coronavirus. SIR and SEIR are continuous-time models which are a class of initial value problems (IVPs) of ordinary differential equations (ODEs). Discrete-time models such as regression and machine learning have also been applied to analyze COVID-19 pandemic data (e.g. predicting infection cases), but most of these methods use simplified models involving a small number of input variables pre-selected based on a priori knowledge, or use very complicated models (e.g. deep learning), purely focusing on certain prediction purposes and paying little attention to the model interpretability. There have been relatively fewer studies focusing on the investigations of the inherent time-lagged or time-delayed relationships e.g. between the reproduction number (R number), infection cases, and deaths, analyzing the pandemic spread from a systems thinking and dynamic perspective. The present study, for the first time, proposes using systems engineering and system identification approach to build transparent, interpretable, parsimonious and simulatable (TIPS) dynamic machine learning models, establishing links between the R number, the infection cases and deaths caused by COVID-19. The TIPS models are developed based on the well-known NARMAX (Nonlinear AutoRegressive Moving Average with eXogenous inputs) model, which can help better understand the COVID-19 pandemic dynamics. A case study on the UK COVID-19 data is carried out, and new findings are detailed. The proposed method and the associated new findings are useful for better understanding the spread dynamics of the COVID-19 pandemic. The past 18 months have witnessed the devastating spread of the COVID-19, a disastrous global pandemic which has been and still is affecting almost every single person at each corner of the world. The attention paid to COVID-19 over the past 18 months categorically surpasses that to anything else. For example, when searching with the keyword "COVID-19" and the scope of "abstract" in the database of Web of Science, the number of published articles is 94026. With the same keyword and scope, the number of published articles in the Elsevier's abstract and citation database, Scopus, is over 112,000. With the same keyword, but only search with the scope of "in the title of the article", the number of articles given by Google Scholar is over 263,000, and if the scope is changed to "anywhere in the article", the number of publications becomes reaches over 4,340,000. Clearly, the numbers of 3 Let S(t), E(t), I(t), R(t) and D(t) (t can be understood in time unit of day) denote the numbers of susceptible individuals, exposed (infected but not yet infectious to transmit the disease) individuals, infectious or infected actively individuals, those recovered from the disease, and those who have died from the disease, respectively. From the mathematical theory of infectious diseases, the spread dynamics of COVID-19 can be characterized by the following SEIR model [11] (1) with initial values S(t0) = S0, E(t0) = E0, I(t0) = I0, R(t0) = R0, and D(t0) = D0, satisfying S0 + E0 + I0 + R0 + D0 = N, where N is the full size of the susceptible population initially exposed at the initial time t0. The parameters β (average transmission rate) and r (average lethality) change with time t (e.g. day), and the other two parameters δ and γ are assumed to be positive constants, which are defined as δ=1/TALP and γ =1/TATT, where TAL and TAT represent the average latent period and the average transmission time, respectively. Clinical study results show that for COVD-19, TAL is around 5 days on average [12] and TAT is around 14 days [13] . Note that the five variables, S(t), E(t), I(t), R(t) and D(t), in the ordinary differential equation model (1) obey the following conservation law [11] : S(t)+E(t)+I(t)+R(t)+D(t) = N (for any t ≥ t0). Based on the SEIR model (1), Zingano et al. [11] developed the formula for calculating the reproduction number Rt as follow: Predict future behavior with new data Interpretation & explanation is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 1, 2021. ; https://doi.org/10.1101/2021.11.01.21265653 doi: medRxiv preprint 4 NARMAX methods were initially developed for solving control systems engineering modelling problems, especially complex nonlinear system identification tasks, but gradually have been successfully applied to a wide range of multidisciplinary domains including medical [14] , neuroscience [15] , social science [16, 17] , climate and weather [18] , space weather [19] , among others [13] . NARMAX methods employ discrete-time dynamic models [20] . Specifically, assume that a system output y is potentially driven by r input variables, designated by u1, u2, …, ur, the general form of the NARMAX model is: where y(t), u(t) and e(t) are the measured system output, input and noise sequences respectively at time instant t; ny, nu, and ne are the maximum lags for the system output, input and noise; τy and τu are a time delay between the input and output, and usually τy = τu =1; f[•] is some non-linear function that needs to be estimated from measured or observed input and out data. Note that the noise e(t) is unmeasurable but can be replaced by the model prediction error in system identification procedure. The noise terms are included to accommodate the effects of measurement noise, modelling errors, and/or unmeasured disturbances. In practice, NARMAX models can be implemented using different approaches, such as recurrent neural networks [21] , radial basis function (RBF) networks [22, 23] , wavelet neural networks [24, 25] , along with others [13] . More than often, the polynomial representation, due to its attractive interpretation properties, is employed to implement NARMAX models [26] [27] [28] . In this study, the power-form polynomial basis is considered. NARMAX model identification usually starts from a specified dictionary or library, consisting of a sufficiently large number of candidate model elements (e.g. model terms or regressors), each of which is formed by the lagged system input and output variables, such as y(t-1), u1(t-2), u2(t-1), u1(t-1)u2(t-1). A model construction algorithm (or a combination or ensemble of a several algorithms) can then be performed on the dictionary together with a training dataset, to construct sparse or parsimonious models. For example, for a single-input single-output (SISO) system, set ny = 1, nu =1, ne = 0, τy = τu = 1, the candidate dictionary with the nonlinear degree 3  is: The final identified model may be of the form: An efficient and commonly used algorithm for NARMAX model construction is the well-knows forward regression with orthogonal least squares (FROLS) [13, 29] and its variants, e.g., minimization error minimization [30] , iFROLS (iterative FROLS) [31] , uFROLS (ultra-FROLS) [32] . Recently, the LASSO algorithm [33] has also been applied to system identification [34] and feature selection for classification tasks [35] but it turned out that LASSO did not outperform FROLS, because "the lasso is not a very satisfactory variable selection method in the p n case" where n is the number of observations and p is the number of predictors [36] . However, p n is a common case in many real is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 1, 2021. ; https://doi.org/10.1101/2021.11.01.21265653 doi: medRxiv preprint 5 applications. It was theoretically proved by Johnson et al. [37] that the "optimal" L1-norm solutions are often inferior to L0-norm solutions found using stepwise regression; they also compared algorithms for solving these two problems and showed that although L1-norm solutions can be efficient, the "optimal" L1-norm solutions are often inferior to L0-norm solutions found using greedy classic stepwise regression. The FROLS algorithm uses a simple but efficient index, called error reduction ratio (ERR) [13, 29] , to measure the importance or significance of each candidate model term (element) included in the specified dictionary or library, and determine which ones should be included in the model in order of their importance, e.g., in terms of the contributions the can make to explaining the variation of the target signal (system output). The model construction procedure usually leads to transparent, interpretable, parsimonious and simple/simulatable (TIPS) models. A rigid model validation approach [38, 39] can guarantee the validity of the final identified model to sufficiently represent the input-output relationship hidden in the data. Taking the case study, on the UK Understanding Society (UKUS) data, presented in [17] as an example, the identified models using the NARMAX methods show that, on the collective national level of the UK, the factors that appear to have significantly positive impact on happiness are: 'income (living comfortably),' 'income (doing alright),' 'income (just about getting by),' 'retired,' 'health (excellent),' 'health (every good).'; the combination of the two variables of 'retirement' and 'above 65' is also an important model term, which can be explained that that the retired people who are over 65 years old are more likely to be happy. The model also revealed that marriage could enhance the positive relationship between good health status and happiness, while smoke could enhance the negative effect of low income on happiness. More detailed descriptions of NARMAX methods can be found in [13] . This work uses a 10 fold-cross validation scheme and the FROLS algorithm to build TIPS models. In doing so, the entire dataset is first split two parts (for training and testing, respectively), and the training data are then split two parts (around 70%:30%) which are used for model structure selection and performance validation. The standard version of the FROLS algorithm is described in detail in the Appendix. The section provides case studies on the UK COVID-19 data modelling and analysis using the TIPS modelling approach based on the NARMAX methods. The two main objectives are: 1) To establish the relationships between the R number, the daily infection cases and deaths; 2) To make predictions of the daily infection cases and deaths in advance of more than 10 days. The data were extracted from the Johns Hopkins Coronavirus Resource Center (https://coronavirus.jhu.edu/about/how-to-use-our-data). For all the case studies, a total of 529 daily data (infection cases and deaths), from 4 March 2020 to 15 August 2021, are considered for the analysis purpose here, of which the first 361 data (4 March 2020 -28 Feb 2021) are used for model training, estimation and validation, and the remaining 168 data (1 March 2021 -15 August 2021) are used for model performance testing. To examine and investigate the impact of the daily R number values on the daily infection cases in later days, the daily R number is treated as an input and the daily infection cases as the output (response), and is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 1, 2021. ; https://doi.org/10.1101/2021.11.01.21265653 doi: medRxiv preprint 6 the settings for the NARMAX model (1) are: nu = 42 (maximum time lag in input), ny = 0 (without including lagged autoregressive term), ne = 0 (without including noise term), τu = 1 (time delay). The identified model is: 4 3 4 ( ) 3.5551 10 ( 12) 6.2117 10 ( 40) where y = 'daily infection cases' and u = 'daily R number'. The model was simulated driven by the 513 daily data. A comparison between the model predicted values and the real data, over the 361 training data points (4 March 2020 -28 Feb 2021) and 168 test data (1 March 2021 -15 August 2021), is shown in Figure 2 . Model (6) indicates that the R number value of the current day may impact the daily infection cases of 12 days later, lasting until 40 days. This is also reflected in Figure 2 . This important finding has not been noticed in any previous study. Using the information given in by model (6) setting nu = 42 (maximum time lag in input), ny = 42 (maximum lag in output), ne = 0 (without including noise term), τy = τu = 12 (time delay) and the nonlinear degree =2, a best NARMAX model was identified, which is shown in Table 2 . is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint In this third case, the settings as follows. The R number and daily infection cases are used as two inputs, and the number of daily deaths is considered to be the output. The other coefficients of the NARMAX (1) are chosen as: nu = 42 (for both inputs), ny = 0 (no autoregressive variable is included in the model), ne = 0 (without including noise term), τu =12 (time delay) and the nonlinear degree =2. The identified NARMAX model is shown in Table 2 . Note that the model identification algorithm did not find any nonlinear models that outperforms the linear model shown in Table 1 , this suggests or implies that there is no or very weak nonlinear relation along the input and output variables; the relationship is dominated by linearity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint It can be observed from Table 2 that the values of the R number (input u1) can potential affect the mortality after 12 days lasting until 41 days, whereas the daily infection cases (u2) can potential affect the mortality after 13 days lasting until 27 days. From Table it can be noticed that the daily infection cases of current day (with ERR = 75.9%) potentially highly affect the number of deaths 13 days later. The model was simulated over the entire data (4 March 2020 to 15 August 2021). A comparison between the model predicted values and the corresponding records, over the training and test data, is shown in Figure 3 , where it can be seen that the model performs excellent on the training data and most part of the test data (e.g. until 21 June 2021), with the value of R-square is close to 0.8). However, it can be clearly seen that the model fails to predict the number of daily deaths in most recent days (e.g. around 21 June 2021 and onward), although the daily infection cases is still very high as shown in Figure 3 . This is reasonable and may probably be explained that more and more people have received a second vaccine which has helped significantly reduce the death rate, and this confirms the effectiveness of the UK government's vaccination policy. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 1, 2021. ; https://doi.org/10.1101/2021.11.01.21265653 doi: medRxiv preprint 9 It is worth mentioning the following two points: 1) Under the settings of nu = 42, ny = 0, ne = 0, =2, the best variables that the model training algorithm identified are u2(t-13), u1(t-14), etc. This suggests that the inclusion of any other variables cannot enhance the explanation of the variation of the daily death cases using the daily infection cases and the daily value of R number, and therefore cannot help improve the prediction performance. 2) It is straightforward to detect the periodical change of the daily death cases (with a period of 7 days), the correlation between the lagged variable y(t-7) and the original signal is as high as 0.93. The main purpose of this section is to reveal the inherent dynamics that project daily infection cases and R-number to daily death cases many days later. To avoid the impact of the strong autocorrelation on the analysis of the underlying dynamics between the input and output variables, the lagged autoregressive variables, such as y(t-1), y(t-2), etc. were deliberately not considered when constructing the model in this case study. The prediction of the COVID-19 pandemic is important and challenging. However, a complicated blackbox that lacks interpretability (e.g. without explicitly providing information on the inherent dynamics) may become less useful or powerful for applications where there is a need to know the relationship of the inherent dynamics. The main attention of this work was paid to developing a glass-box modelling approach. The main contributions of the work can be summarized as follows: Firstly, it proposed a TIPS-ML framework based on the NARMAX methods, and applied the proposed approach to modelling the spread dynamics based on the UK COVID-19 data. In comparison with other complicated machine learning methods, the proposed method has several highly attractive properties, such as transparency, interpretability, parsimony, and simplicity/simulatability. These properties are very important for investigating and understanding the spread dynamics of the pandemic, which may not be able to obtained by using other machine learning methods (e.g. those complicated black-box neural network models). Secondly, some important new findings have been obtained from the identified TIPS models. For example, the R number of the current day may significantly impact the daily infection cases 12 days and last as long as 42 days; the combinational effect of R number and infection cases of current day can be potentially very high on the infection cases 12 days later; the number of daily deaths is highly dependent on R and daily infection cases (DIC) but lags R from 14 to 41 days and lags DIC from 13 to 27 days. These new findings, which have not been observed before, are useful for better understanding the spread dynamics of the pandemic. The case studies carried out in this work focused on the UK COVID-19 data. In future, more data of different countries will be considered and analyzed using the proposed method will be applied, to investigate and compare the pandemic spread dynamic patterns, from which to acquired information that may be useful for healthcare and infectious disease studies. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 1, 2021. ; https://doi.org/10.1101/2021.11.01.21265653 doi: medRxiv preprint The TIPS models are built using a 10 fold-cross validation scheme and based on the FROLS algorithm. Taking a single-input, single-output as an example, an initial TIPS model can easily be converted into a linear-in-the-parameters form: The initial regression model (A1) often involves a large number of candidate model terms. Experience suggests that most of the candidate model terms can be removed from the model, and that only a small number of significant model terms are needed to provide a satisfactory representation for most nonlinear dynamical systems. The FROLS algorithms (Billings, 2013) can be used to select significant model terms. Consider the term selection problem for the linear-in-the-parameters model (A1). Let A non-centralised squared correlation coefficient will be used to measure the dependency between two associated random vectors. The non-centralised squared correlation coefficient between two vectors x and y of size N is defined as The squared correlation coefficient is closely related to the error reduction ratio (ERR) criterion (a very useful index in respect to the significance of model terms), defined in the standard orthogonal least squares (OLS) algorithm for model structure selection (Chen et al., 1989; Billings, 2013) . The model structure selection procedure starts from equation (A1). Let y r  0 , and )} , . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the correlation coefficient defined by (A2). The first significant basis can thus be selected as 1 1  φ α  , and the first associated orthogonal basis can be chosen as . The model residual, related to the first step search, is given as is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 1, 2021. The model residual n r will be used to form a criterion for model selection, and the search procedure will be terminated when the norm 2 || || n r satisfies some specified conditions. Note that the quantity ) , ( ERR Finally, a mean square error (MSE) based algorithm, e.g. Akaine's information criterion (AIC), Bayesian information criterion, generalized cross-validation (GCV) and adjustable prediction error sum of squares (APRESS) can be used to determine the model size [40] . It is easy to verify that the relationship between the selected original bases 1 ,, n αα , and the associated orthogonal bases 1 ,, n qq , is given by is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 1, 2021. ; https://doi.org/10.1101/2021.11.01.21265653 doi: medRxiv preprint Analytical features of the SIR model and their applications to COVID-19 SEIR modeling of the COVID-19 and its dynamics 2020) A modified SEIR model with confinement and lockdown of COVID-19 for Costa Rica. medRxiv A modified SEIR model to predict the COVID-19 outbreak in Spain and Italy: simulating control scenarios and multi-scale epidemics Time series forecasting of new cases and new deaths rate for COVID-19 using deep learning methods Prediction of COVID-19 confirmed cases combining deep learning methods and Bayesian optimization Time series prediction for the epidemic trends of COVID-19 using the improved LSTM deep learning method: Case studies in Russia, Peru and Iran Sparse, Interpretable and transparent predictive model identification for healthcare data analysis NARMAX model as a sparse, interpretable and transparent machine learning approach for big medical and healthcare data analysis Nonlinear System Identification: NARMAX Methods in the Time, Frequency, and Spatio-Temporal Domains Herd immunity for COVID-19 in homogeneous populations Reproduction number (R) and growth rate (r) of the COVID-19 epidemic in the UK: methods of estimation, data sources, causes of heterogeneity, and use as a guide in policy formulation Transmission of SARS-CoV-2: implications for infection prevention precautions: Scientific Brief The prediction of in-flight hypoxaemia using non-linear equations Nonlinear modeling of cortical responses to mechanical wrist perturbations using the NARMAX method The dominance of food supply in changing demographic factors across Africa: A model using a systems identification approach Significant indicators and determinants of happiness: Evidence from a UK survey and revealed by a data-driven systems modelling approach Complex systems modelling for statistical forecasting of winter North Atlantic atmospheric variability: A new approach to North Atlantic seasonal forecasting System identification and data-driven forecasting of AE index and prediction uncertainty analysis using a new cloud-NARX model Discrete-Time Dynamic Models Nonlinear Identification and Control: A Neural Network Approach Orthogonal forward selection for constructing the radial basis function network with tunable nodes Generalized multiscale radial basis function networks A new ridge basis function neural network for data-driven modeling and prediction Boosting wavelet neural networks using evolutionary algorithms for short-term wind speed time series forecasting Improved structure selection for nonlinear models based on term clustering Retrieving dynamical invariants from chaotic data using NARMAX models A Bird's Eye View of Nonlinear System Identification Orthogonal least squares methods and their application to non-linear system identification An identification algorithm for polynomial NARX models based on simulation error minimization An iterative orthogonal forward regression algorithm Ultra-orthogonal forward regression algorithms for the identification of non-linear dynamic systems Regression shrinkage and selection via the lasso A novel LOO based two-stage method for automatic model identification of a class of nonlinear dynamic systems Sparse representation-based classification: Orthogonal least squares or orthogonal matching pursuit? Regularization and variable selection via the elastic net A risk ratio comparison of L0 and L1 penalized regression Nonlinear model validation using correlation tests Fast orthogonal identification of nonlinear stochastic models and radial basis function neural networks Nonlinear predictive model selection and model averaging using information criteria