key: cord-0160097-gaousjt0 authors: Lima, Alexandre Barbosa de title: An exploratory time series analysis of total deaths per month in Brazil since 2015 date: 2020-08-26 journal: nan DOI: nan sha: c18aa63af13585390e59ca05c3f79feb8782008b doc_id: 160097 cord_uid: gaousjt0 In this article, we investigate the historical series of the total number of deaths per month in Brazil since 2015 using time series analysis techniques, in order to assess whether the COVID-19 pandemic caused any change in the series' generating mechanism. The results obtained so far indicate that there was no statistical significant impact. In broad terms, a time series consists in a set of numbers corresponding to the observation of a certain phenomenon. Figure 1 shows the Nile River mimima for the years 622 AD to 1284 AD 1 [9] . By nature, such numbers are realizations of random variables. In general, a collection of random variables, {x t }, indexed by t, is referred to as a stochastic process [10] . In this paper, t will be discrete and vary over the integers t = 0, ±1, ±2, . . .. Box and Jenkis [11] introduced the class of stationary ARMA(p, q) (autoregressive moving average) models 2 where µ = E{x t } is the mean of x t , {φ 1 , φ 2 , . . . , φ p } and {θ 1 , θ 2 , . . . , θ q } are parameters of the model, and w t is a wide-sense stationary white noise process with zero mean and power σ 2 , i. e., w t ∼ (0, σ 2 ). In a more compact form, we have where x t = x t − µ, B is the backward shift operator (Bx t = x t−1 ), φ(B) is the autoregressive (AR) operator of order p and θ(B) denotes the moving average (MA) operator of order q In the rest of this paper, we will assume µ = 0 without loss of generality. The process x t can be viewed as the output of a digital filter (ARMA filter) whose input is w t , with system function where H(z) denotes the z-transform of the impulse response h t of the ARMA filter. An ARMA(p, q) process x t is said to be wide-sense stationary (or non-explosive) if the poles of H(z) in (5) lie inside the complex unit circle (|z| = 1), and it is invertible if the zeros of H(z) in (5) lie inside the unit circle. The autocorrelation function (ACF) 3 ρ h of an ARMA(p, q) process shows exponentially decay to zero, i. e., at lag h converges rapidly to zero as h → ∞ (short memory property) [11] . A random process x t is wide sense (or weakly) stationary if its mean is constant [12, p. 298 ] and if its ACF depends only on the lag τ = t 2 − t 1 : In the literature, it is common to use the terms time series and stochastic process interchangeably [10] , [13] , [14] , [15] . From now on, we will only use the term time series. The context will indicate to the reader whether it is a process or a realization of a process. The modeling of a time series x t consists on estimating an invertible function h(.), called model of x t , such that in which w t ∼ Independent and Identically Distributed -IID and in which g(.) = h −1 (.). The process w t is the innovation at instant t and represents the new information about the series that is obtained at instant t. In practice, the adjusted model is causal, i. e., The model construction methodology is based on the iterative cycle illustrated by Fig. 2 [11] : (a) a general class of models is considered for the analysis (specification); (b) there is the identification of a model, based on statistical criteria; (c) it follows the estimation phase, in which the model's parameters are obtained. In practice, it is important that the model is parsimonious 4 ; and (d) at last, there is the diagnostic of the adjusted model by means of a statistical analysis of the residual series w t (is w t compatible with a white noise process?) The process x t of (10) is linear when it corresponds to the convolution of a process w t ∼ IID and a deterministic sequence h t [16] [p. 377] 3 We assume that the autocorrelation function is given by ρ h = γ h γ 0 , where γ h corresponds to the autocovariance of xt at lag h. 4 We say that a model is parsimonious when it uses few parameters. The use of an excessive number of parameters is undesirable because the uncertainty degree of the statistical inference procedure increases with the number of parameters. Eq. (11) is also known as the infinite order moving average (MA(∞)) representation [17] . As the models in practice are invertible, the model of x t can be rewritten in an infinite order autoregressive (AR(∞)) form: An order p AR model satisfies the equation in which φ(B) is an order p polynomial. In practice, the order p of an AR series is unknown and must be empirically specified. In this paper, we use an information criterion function [14] as will be explained below. The basic idea of an ARMA model selection criterion is to choose the orders k and l that minimize the quantity in whichσ 2 k,l is a residual variance estimate obtained by adjusting an ARMA(k, l) model to the N series observations, and C(N ) is a function of the series size. The quantity (k + l) C(N ) N is called penalty term and it increases when the number of parameters increases, whileσ 2 k,l decreases. Akaike proposed the information criterium [18] , [19] AIC(k, l) = lnσ 2 k,l + known as AIC, in whichσ 2 k,l is the maximum likelihood estimator of σ 2 w for an ARMA(k, l) model. Upper bounds K and L for k and l must be specified. Eq. (15) has to be evaluated for all possible (k, l) combinations with 0 ≤ k ≤ K 0 ≤ l ≤ L. In general, K and L are functions of N , for example, K = L = ln N . For the case of AR(p) models, (15) reduces to Having identified the AR model's order p, we can go to the parameters estimation phase. The methods of moments, Least Squares and Maximum Likelihood may be used [20] , [21] . As, in general, the moments estimators are not good [21] , statistical packages as R and S-PLUS use some Least Squares or Maximum Likelihood estimator. If a process which corresponds to the difference of order d = 1, 2, . . . of x t is stationary, then y t can be represented by an ARMA(p, q) model In this case, is an ARIMA(p, d, q) model and we say that x t is an "integral" of y t [21] because The ARIMA(p, d, q) model is marginally stable [22] , as it has d roots on the unit circle. Also, x t of (19) is a homogeneous non-stationary process (meaning non-explosive) or having unit roots [13] , [14] , [21] . Observe that [21] [p.139]: (a) d = 1 corresponds to homogeneous non-stationary series with respect to the level (they oscillate around a mean level during a certain time and then jump to another temporary level); (b) d = 2 corresponds to homogeneous non-stationary series with respect to the trend (they oscillate along a direction for a certain time and then change to another temporary direction). The ARIMA model (19) may be represented in three ways: (a) ARMA(p + d, q) (similar to Eq. (1)) (b) AR(∞) (inverted format), given by (12) or (c) MA(∞), according to (11). Consider the model y t ∼ I(1) in which x t is a stationary process. If we assume the initial condition y 0 , (23) can be rewritten as an integrated sum The integrated sum t j=1 x j is called stochastic trend and it is denoted by T S t . Observe that in which T S 0 = 0. (23), then y t is known as random walk. Including a constant in the right side of (23), we have a random walk with drift, Given the initial condition y 0 , we can write The mean, variance, autocovariance and ACF of y t are given by [21] µ t = y 0 + tθ 0 (28) Observe that ρ k (t) ≈ 1 when t >> k and the literature states that the random walk has "strong memory" [14] . The random walk's Sample Autocorrelation Function (SACF) decays linearly for large lags. Spectral analysis is a well-established research area [20] . However, the estimation of the power spectrum of a signal is not a trivial matter. There are two classes of spectral analysis techniques currently in use: parametric (or model-based) and nonparametric analysis. Both methods are used in this work. The fundamental idea of parametric spectral analysis is fairly simple. The parametric approach assumes that the signal satisfies a generating model, such as an AR(p), with known functional form and then proceed by estimating the parameters in the assumed model [20] , [23] . The most widely used form of parametric Power Spectral Density (PSD) estimation uses an AR(p) model [20] . Let us now consider the nonparametric (or classical) method. Then, the estimation of PSD of a time series x t can be made using periodogram methods based on the Discrete Fourier Transform (DFT), which can be efficiently calculated by a Fast Fourier Transform (FFT) Algorithm. In the sequence we present the periodogram method. Consider a time series x t with N values or samples, i. e., x t = 0 outside the time interval 0 ≤ t ≤ N − 1. In some cases of interest, we consider that x t has a size N , even if its actual size is M ≤ N (in such cases the series x t must be completed with (N − M ) zeros (zero padding). The equation (32) is the DFT of x t : The PSD of x t can be estimated by calculating the periodogram, given by where X[k] denotes the DFT of x t , and The periodogram is an asymptotically unbiased spectral estimator of the PSD of a random signal [20] , [23] . Its main problem lies in its large variance. In other words, the periodogram is inconsistent (i. e., the dispersion of the estimates is independent of N ). This motivates the use of a "refined periodogram method", like the Daniell method. It is possible to show that the periodogram values P x (f k ) are asymptotically uncorrelated random variables [20] . Thus we may reduce its large variance by weight averaging the periodogram over small intervals centered on the current frequency f k . The practical form of the Daniell estimate can be performed using the FFT. This work uses a Daniell kernel for nonparametric PSD estimation. For further details, please refer to [20] and [23] . The series in the Fig. 3 strongly suggests that we can specify a model of the type [10] [p. 58] where x t are the observations, y t is a stationary process, and µ t denotes a linear trend given by the regression model in which β 0 and β 1 are the intercept and the slope parameters. Table 1 shows the estimated coeffcientes for (35) and its p-values, which are negligible. The goodness of fit is summarized by the R 2 of the regression [13] [p. 169]. Note that the R 2 statistics given by Table 2 indicate that the proposed model for µ t explain approximately 75.8% of the total variability of the data. The great value of the F-statistic and the neglibible model p-value show that we can not reject the null hypothesis of linear regression. Thus, we can consider a linear model for (35) to be statistically significant given the statistical significance level of 0.01. Figure 4 shows the time series with the superimposed linear regression model. To verify that the model (35) is appropriate, it is also necessary to investigate the residuals. This is what is called residual analysis. The residuals of (35 ) correspond to the discrepancies between the observed values (µ) and the values adjusted (μ) by the model. The i-th residual is given byê Figure 5 shows the residuals vs fitted model and the Quantile-Quantile plot (QQ-plot) of the residuals, which suggests that they are normally distributed. Table 3 shows a residual diagnostics using the Jarque-Bera and Shapiro-Wilks tests [13] [p.61] for the null hypothesis of normality of the residuals. The null of normality is not rejected using either tests. At this point, we can conclude that: • there is no change point in the deterministic linear trend of the historical series. The first step in exploratory analysis is to remove the deterministic trend of Eq.(35) [10] [p. 58]. There are two alternatives: remove the line estimated by the regression or take the first difference in the series. As our goal is to coerce the data to (a possible) stationarity, then differencing may be more appropriate [10] [p. 61]. The first two samples of the historical series were discarded so that the series corresponding to the first difference has 64 points, that is, 2 6 points, which facilitates the spectral analysis with FFT. The first difference can be denote as (37) Figure 6 shows the series r t (we also demeaned the series) and its SACF. Figure 7 shows the histogram of r t with a superimposed normal distribution and its Q-Q plot, which suggests that r t follows a normal distribution. Figure 8 shows the smoothed periodogram using the Daniell method and the PSD for the estimated model of r t , which is an AR (11) . We used the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test [24] for the null hypothesis that r t is level stationary, i. e., that the series is I(0). We obtained a p-value of 0.1, which means that one can not reject that r t is level stationary. Finally, but not least, an informal analysis of Fig. 6 suggests that the variance of r t has a change point around t = 22 (2017/jan). However, if this really happened, which we cannot guarantee with the techniques employed in this article, ocurred long before the outbreak of COVID-19 in Brazil. This possible change point motivates a time-frequency domain analysis using wavelets, as the localized nature of wavelet coefficients allows one to analyze the evolution of the series variance over time [20] . This will be investigated in future work. In this paper, we presented an exploratory time series analysis of the historical series of the total number of deaths per month in Brazil since 2015. Our preliminary results indicate that: • there is no statistical evidence that COVID-19 affected the deterministic linear trend of the historical series, i. e., on average, the monthly growth in the number of deaths, which is approximately 675 deaths/month, did not change since the first recorded death in Brazil on March 16, 2020; • there is no change point in the deterministic linear trend of the historical series; • there is significant statistical evidence that the first difference time series is stationary; and • there is no statistical evidence that COVID-19 provoked a change in the stochastic process that generates the time series under analysis 5 . These results are thought provoking and not intuitive. COVID-19 has caused many deaths around the world. This is an indisputable fact. However, our results suggest that this disease does not have so far an additive effect on the total number of deaths per month in Brazil. What would be a plausible explanation for this strange result? In any case, further research should be carried out to confirm the results obtained. In future work, we will analyze the historical series using wavelet methods. COVID-19 no Brasil Brazilian Federal Ministry of Health (Ministério da Saúde do Brasil) Coronavirus disease (COVID-19) pandemic COVID-19 Dashboard by the Center for Systems Science and Engineering The R Project for Statistical Computing Code and Data Civil Registry Offices of Brazil (Cartórios de Registro Civil do Brasil) Conselho Nacional de Justiça Mémoire sur l'histoire du Nil, ser. Mémoires de l'Institut d'Egypte: Institut d'Egypte. Le Caire : Imprimerie de l'Institut français d'archéologie orientale Time Series Analysis and Its Applications with R Examples Time Series Analysis: Forecasting and Control Probability, Random Variables, and Stochastic Processes Modeling Financial Time Series with S-PLUS Analysis of Financial Time Series Stable non-Gaussian random processes Introduction to Time Series and Forecasting Information theory and an extension of the maximum likelihood principle A new look at the statistical model identification Spectral Analysis for Physical Applications Análise de Séries Temporais Digital Signal Processing Spectral Analysis of Signals Testing the null hypothesis of stationarity against the alternative of a unit root