key: cord-0004162-09qem2ay authors: Liu, Liu; Yue, Jin; Lai, Xin; Huang, Jianping; Zhang, Jian title: Multivariate nonparametric chart for influenza epidemic monitoring date: 2019-11-25 journal: Sci Rep DOI: 10.1038/s41598-019-53908-6 sha: 0bcd6bf98b395092a903cf17c6099299d5013940 doc_id: 4162 cord_uid: 09qem2ay Control chart methods have been received much attentions in biosurvillance studies. The correlation between charting statistics or regions could be considerably important in monitoring the states of multiple outcomes or regions. In addition, the process variable distribution is unknown in most situations. In this paper, we propose a new nonparametric strategy for multivariate process monitoring when the distribution of a process variable is unknown. We discuss the EWMA control chart based on rank methods for a multivariate process, and the approach is completely nonparametric. A simulation study demonstrates that the proposed method is efficient in detecting shifts for multivariate processes. A real Japanese influenza data example is given to illustrate the performance of the proposed method. Control charts are useful tools for fault detection 1 . Shewhart chart, CUSUM chart and EWMA chart are most popular tools in statistical process control. These control charts are efficient and fruitful for fault diagnosis in practical applications. Most control charts that need observations are univariate and usually assume that the observation follows a known gaussian distribution. In real life, we usually process multivariate or high-dimensional variables rather than univariate variables. The monitoring of high-dimensional data in a timely manner has become increasingly important in quality control. Hotelling 2 proposed a T-squared control chart for multivariate process, which assumes that the dataset distributions are multivariate normal distribution. Both the parameters of mean vector and variance matrix are known. Based on T 2 statistics, Lowry et al. 3 proposed a multivariate CUSUM chart. Furthermore, Sullivan and Woodall 4 provided a change-point chart for detecting a shift of the location parameter, the scale parameter. However, statistical process control is a challenge when the underlying distribution and the magnitude of changes are both totally unknown. For the situation of a multivariate process with an unknown distribution, Yue and Liu 5 , from the point of Mahalanobis data depth, introduced a chart for monitoring processes for multivariate process. Data depth is efficient and totally nonparametric. However, the computational complexity is high as the number of variables grows and may influence the performance of detection of a chart. In addition, the covariance matrix of the data depth method is constant 5 . Therefore, the method may be unsuitable when the covariance matrix in a process is not stable. Zou and Tsung 6 proposed a new multivariate EWMA chart to detect location parameters. The chart is affine-invariant, and its controlled run length distribution is the same for the class of distributions with elliptical directions. Some strictly distribution-free rank-based methods have been developed to increase the efficiency in detecting a nonparametric process [7] [8] [9] . The computation speed of these rank-based methods is fast, and the methods are easy to implement. However, all of these methods focus on a univariate process. In our article, we introduce a new nonparametric multivariate EWMA chart based on rank method, which is combined with the Hotelling T 2 statistic for a multivariate process. This method is completely distribution-free, and it is easy to implement in applications. Moreover, the covariance matrix of observations keeps being updated as new observations arrive. Additionally, the computation load is very light. For multivariate or high-dimensional statistical process control, location parameter shifts sometimes occur in only one or a few characteristics in a process. We want to detect these shifts quickly, accurately and to identify the shifted location parameter components. Consider this issue, fruitful nonparametric control charts have been introduced in the literature. Qiu and Hawkins 10 and Hawkins 11 constructed a new multivariate statistical process control chart and indicated that proposed chart was more efficient than the T 2 control chart when a shift occurred in only one characteristic. However, the shift of a process is usually unknown and may occur in several highly correlated variables. To address this issue, in the context of a process where the location parameter often changes in a few number of variables, Zou and Qiu 12 proposed a useful multivariate statistical process control chart by using the LASSO tool. In addition, inspired by Zou and Tsung, Liang et al. 13 came up with a new multivariate EWMA chart to monitor sparse mean changes. In our paper, the proposed method is designed to detect sparse mean changes, and the results shows that this method performs relatively better in applications. Previous studies showed that the multivariate control chart could be useful for biosurveillance. Rogerson and Yamada 14 proposed a multivariate cumulative sum approach to detect the change in spatial patterns and applied it to a county-level breast cancer datasets. Their results suggested that the proposed chart for multivariate process performed relatively better compared with the univariate method when shifts occurred in many regions. Abdollahian and Hayati Rezvan 15 applied a multivariate EWMA control chart to monitor patient's progress after cardiac surgery, in which the proposed multivariate EWMA chart can detect an out-of-control signal that was missed by the univariate EWMA charts. This is because that the correlations between charting statistics are ignored in univariate chart. Then the univariate chart may give a misleading indication when such correlation is considerably high. The structure of this paper is organized as follows: in Section 2, the rank-based method is given, and a nonparametric chart for online monitoring is provided. A simulation of this control chart is presented in Section 3. Real data are studied to illustrate the performance of the proposed control chart in Section 4. Finally, some conclusions are presented in Section 5. EWMA control chart. The EWMA control chart has good properties for control applications. Lucas and Saccucci 16 studied the performance of EWMA and CUSUM charts. In their paper, the EWMA chart has relatively better performance for small shifts with an appropriate smoothing parameter. The EWMA control chart is first introduced for univariate variables. The EWMA control chart is easy to construct and implement, and it is based on the following statistic: where the starting value is Z 0 = 0, and λ is a smoothing parameter. X i represents the observations in a process. The EWMA chart corresponds to a Shewhart control chart when λ = 1. The weight of the historical data is decided by the magnitude of the smoothing parameter. A process is considered out-of-control (OC) whenever Z i falls outside the range of the control limits. A rank-based method is first given for a one-dimensional process. Liu et al. 9 introduced the rank-based method and assumed that independent observations, X i , follow the model below: where μ 0 is the in-control (IC) location parameter, and μ 1 is the OC location parameter. τ is the unknown change point. F is an unknown continuous distribution function. Let R i denote the i th sequential rank; Liu et al. 9 presented the formula for the rank of X i among X 1 , X 2 , …, X i , …, X n as follws: The standardized sequential rank was defined as www.nature.com/scientificreports www.nature.com/scientificreports/ Therefore, the distribution of R i * is defined in the interval The asymptotic distribution of R i * is U(− 3 , 3 ) as i → ∞. In the context of a multivariate process, it is supposed that there are m independent observations from an unknown multivariate continuous distribution with dimensionality p. That is, There are p characteristics to be examined that we are interested in. For a set of variables, Y j,1 , Y j,2 , …, Y j,m , j = 1, 2, …, p, which represents the j th characteristic with m observations, the rank-based method can be used to construct statistics. When the observations are p-dimensional, the i th observations are * denote the i th standardized sequential rank with the arrival of the j th component Y j,i . Therefore, the vectors , <λ k ≤ 1 represents the smoothing parameter. I represents the p-dimensional identity matrix. If there is no a priori information given, different smoothing parameters are needed for different components; then, λ 1 = λ 2 = ··· = λ k = ··· = λ p are used, and the starting value is Z 0 = (0, 0, …, 0)′. The process is considered to be OC if a manufacturing or business process is in a state of uncontrollable (i.e. Z i where L is the upper control limit. And the covariance matrix of Z i is as follows: λ is a fixed value. Usually, we take the limit form, Σ Zi = λ/(2−λ)Σ. Σ, the covariance matrix of Q i , is estimated from samples in practice. In the art of research, fruitful distribution-free control charts have been introduced. If a chart IC run-length distributions are the same to every continuous distribution 17 , we call this chart is nonparametric or distribution-free. We discuss the choice of parameter by using the multivariate normal distribution. This indicates that the determine of parameters are still valid when a series of observations obey other distributions. Therefore, we consider the i th observation, X i , is collected as time goes by using the following relational model: And α is the probability of a type I error and β is the probability of a type II error. For a fair comparison, we usually fix α and compare β. A small β is considered better. The average run length (ARL) is the number of points that, on average will be plotted on a control chart before an OC signal. If a manufacturing or business process is IC: www.nature.com/scientificreports www.nature.com/scientificreports/ If the process is considered OC: Therefore, we fix IC ARL, ARL 0 and compare OC ARL, ARL 1 . A small ARL 1 is considered better. Meanwhile, inspired by Han and Tsung 18 , we consider the relative mean index (RMI) values to evaluate the average performance of these charts for detecting a range of parameter changes, which are given as following: where m is the number of shifts that we considered. When detecting a certain shift δ X , ARL δX denotes as the OC ARL of these given charts. And MARL δX is the smallest OC ARL among all the OC ARL values of these charts when detecting a certain shift δ X . The RMI calculates the average of all the detection efficiency values 18 . A control chart with a relatively smaller RMI value is regarded as relatively better detection efficiency. We suppose that there are 1000 independent and identically distributed historical (reference) observations. X 1 , X 2 , …, X 1000 are 1000 random observations from N(μ 0 ,Σ 0 ). To make a fair comparison, all of these control charts have the same IC zero-state ARL, which is equal to 500. It should be note that zero-state run lengths refer to the run lengths of control charts initialized at the target value 16 . When the process goes OC, a chart is considered as a better detection efficiency with a small ARL. The ARLs of these EWMA methods with λ = 0.03 for a range of shifts are presented in Table 1 . EWMA 1 represents the rank-based EWMA scheme, and EWMA 2 represents an EWMA control chart based on the Mahalanobis depth method 5 . We also provide simulation studies with the non-diagonal covariance matrix The ARLs of the EWMA scheme with λ = 0.03 for a range of shifts under N(μ 0 ,Σ 1 ) are presented in Table 2 . In addition, the detection performance of these charts under a bivariate Weibull distribution, LBVW(θ 1 , θ 2 ,α,ρ) are Table 6 . ARL comparisons for the EWMA control chart designed to detect a shift under multivariate Poisson(θ 1 + δ X , θ 2 , θ 0 ) with a zero-state ARL = 500, where (θ 1 , θ 2 , θ 0 ) = (0.5, 0.6, 0.2). www.nature.com/scientificreports www.nature.com/scientificreports/ shown in Table 3 . θ 1 and θ 2 are the scale parameter. α is the shape parameters. ρ is the correlation coefficient. When a process is IC, θ θ α ρ X X LBVW ( , ) ( , , , ) when the process is OC. Tables 1-3 provide the ARL of the EWMA 1 and EWMA 2 control charts for a range of shifts δ X . Tables 1-3 show that the EWMA 1 control chart has a relatively better performance for detecting small shifts. EWMA 2 has a better performance for detecting large shifts. On the whole, EWMA 1 has a relatively small RMI. Table 4 presents the simulation results under N(μ 2 ,Σ 2 ), where μ 2 = (0, 0, 0, 0, 0, 0) and Σ 2 is 6 × 6 indentity matrix. Table 4 shows that EWMA 1 still performs better. Sometimes, we encounter the case that observations follow block-diagonal correlation structures. Therefore, we provided ARL comparisons for observations follow a block-diagonal correlation structures, which presented in Table 5 . Where μ 3 = (0, 0, 0, 0) and Table 5 shows the proposed methods performs relatively better. In addition, the proposed control chart based on ranks of data is a nonparametric method without assuming normal or Poisson distribution for the data. To investigate the performance of the proposed method for Poisson data, we conducted an additional simulation study under multivariate Poisson distribution. Results in Table 6 showed that the proposed methods (EWMA 1 ) still had a better performance in terms of the OC ARL and RMI. In addition, we also provide the computing time of the EWMA 1 and EWMA 2 control charts. From Fig. 1 , EWMA 1 has relatively shorter computing time compared to that of EWMA 2 . Therefore, the proposed EWMA control chart is chosen, which is based on rank methods, for monitoring in this paper. Data source. That is the case, with the Japanese influenza data 19 , which cover 6 regions in Japan. These regions include Gunma, Chiba, Tokyo, Ishikawa, Nagano, and Osaka. Influenza data analysis is a very important issue today 20, 21 . Simultaneous monitoring of flu break-outs in multiple regions is an important topic in epidemiology. Influenza is an acute contagious disease caused by a virus 19 . The Japanese influenza data are used to illustrate the proposed control chart. Time-series data of the weekly incidence of influenza in Japan are used from January 2000 through December 2011. To evaluate the incidence data (see "Influenza Dataset" in Supplementary Information), we conduct spectral analysis, which is useful for investigating the periodicities of shorter time series, such as that of the incidence data used in the present study. www.nature.com/scientificreports www.nature.com/scientificreports/ The Japanese influenza data are presented in Fig. 2 . A quantile-quantile (Q-Q) plot of each region that includes 782 historical observations is presented in Fig. 3 . Figure 3 suggests that the normality assumption for the influenza data is invalid. The correlation of six regions as shown in Fig. 4 , for a total of C 6 2 = 15 lines. Figure 4 shows that the cross-correlation is not stable. Therefore, we update the covariance matrix with the arrival of new observations. It should be noted that the covariance matrix Σ is updated, as presented in section 2.2. In this section, a multivariate control chart is used to monitor the incidence of influenza in six regions which may have a certain correlation. Ignoring the correlation and using several univariate charts could lead to biased conclusions. For example, the univariate chart statistic may result in unnecessarily frequent www.nature.com/scientificreports www.nature.com/scientificreports/ out-of-control signals when the process is actually in control and may not detect the change when the process becomes out of control 3 . In the past few decades, many researchers have studied spectral analysis 22 . In addition to the obvious annual cycle of influenza epidemics, the longer-term incidence patterns are important for interpreting the mechanism of influenza epidemics. The method proposed by Sawada et al. 23 is a combination of spectral analysis and non-linear least squares fitting (LSF) for fitting analysis. Spectral analysis is a useful tool to investigate the periodicities of a short time series, and the formulations of the LSF curve are related to the research of Sawada et al. Spectral analysis is used identify the interepidemic period of influenza epidemics in Japan (see "Computing Code" in Supplementary Information). Based on spectral analysis, the trend of the incidence data is determined. The procedure comprises the following 3 steps. In step I, the influenza data (standardized datasets) are preprocessed. In step II, the temporal behavior of the interepidemic period is investigated. Then, LSF is used for the fitting analysis. This trend is then removed by subtracting the LSF curve from the data, thereby yielding the residual time-series data. In step III, the obtained residual time-series datasets are analysed. www.nature.com/scientificreports www.nature.com/scientificreports/ The vertical coordinates of Fig. 5 represents the power spectral density (PSD). Figure 5 indicates that the numbers of the maximum entropy method (MEM) spectral periods. From Fig. 5 and the processed data, we find that the power has a large magnitude at a frequency of 0.035 (1/week), and there is a second peak at a frequency of 0.019 (1/week). A large magnitude indicates that a large portion of the amplitude of the incidence data is expressed as a wave that repeats itself every year. Spectral analysis has enabled us to identify multiple periodicities for the interepidemic period of influenza epidemics (1-and 0.5-year periods). The residual time-series data are relevant. For residuals data, Table 7 presents the results of Shapiro-Wilk test and Kolmogorov-Smirnov test for normality. The p-values are smaller than 0.05, indicating that the data are non-normally distributed. Therefore, a nonparametric control chart could be more appropriate than those based on normality assumption. Moreover, a first order autoregressive model (AR (1)) is used to analyze the sequence correlation. Table 8 shows that sequences are highly correlated. Thus, the first order difference is employed to reduce the sequence correlation (see results in Table 9 ). Then the differential data can be used to illustrate the proposed method. The EWMA 1 control chart of the residual data series is presented in Fig. 6 . Figure 6 shows that EWMA statis- We provide the performance of EWMA 2 by using Japanese influenza data (Fig. 7) . It can be observed that the EWMA 2 chart shows an inconsistent trend with the result in practice (the charting statistics indicate that the six regions are almost at the epidemic level after 32 cases). This may be caused by the constant covariance setting in EWMA 2 . Hence, updating the covariance between the six regions could be important in correctly detecting an epidemic of influenza. www.nature.com/scientificreports www.nature.com/scientificreports/ We also presented six single univariate control charts for Japanese influenza data in Fig. 8 . The univariate chart statistic gave unnecessarily frequent out-of-control signals when the process is actually in control. Specifically, the first out-of-control signal of six regions occurred approximately at the 30th case, the 61th case, the 42th case, the 24th case, the 27th case, and the 17th case, respectively. However the multivariate chart may suggest a in-control state, indicating that ignoring the correlation between regions in biosurveillance may give an unexpected high rate of false alarm. This paper provides a new EWMA control chart based on rank methods for a multivariate process. The performance of an EWMA control chart based on rank methods and Mahalanobis depth are compared. The EWMA control chart based on rank methods has a relatively better performance for detecting small shifts. Finally, Japanese influenza data are also provided to illustrate the proposed control chart. Spectral analysis is first conducted to investigate the periodicities of shorter time series, and then non-linear least squares fitting is used for fitting analysis. The residual data series are obtained, and the residual data series are monitored. The Japanese influenza data example shows that the proposed control chart has relatively better performance for detecting process changes. The datasets analyzed during the current study are available from the corresponding author on reasonable request. Identifica-tion of hot and cold spots in genome of Mycobacterium tuberculosis using Multivariate quality control-illustrated by air testing of sample bombsights A multivariate exponentially weighted moving average control chart Change-point detection of mean vector or covariance matrix shifts using multivariate individual observations Multivariate nonparametric control chart with variable sampling interval A multivariate sign EWMA control chart A Sequential Rank-Based Nonparametric Adaptive EWMA Control Chart Adaptive phase II nonparametric EWMA control chart with variable sampling interval. Quality and Reliability Engineering International Dual Nonparametric CUSUM Control Chart Based on Ranks. Communica-tions in Statistics-Simulation and Computation A rank-based multivariate CUSUM procedure Multivariate quality control based on regression-adjusted variables Multivariate Statistical Process Control Using LASSO A Robust Multivariate EWMA Control Chart for Detecting Sparse Mean Shifts Monitoring change in spatial patterns of disease: comparing univariate and multivariate cumulative sum approaches Multivariate exponentially weighted moving average chart for monitoring patients progress after cardiac surgery Exponentially weighted moving average control schemes: properties and enhancements Nonparametric Control Charts: An Overview and Some Results A reference-free cuscore chart for dynamic mean change detection and a unified framework for charting performance comparison Time Series Analysis of Incidence Data of Inuenza in Japan Comparing the similarity and difference of three inuenza surveillance systems in China Simultaneous detection of eight avian inuenza A virus subtypes by multiplex reverse transcription-PCR using a GeXP analyser Maximum entropy spectral analysis of time-series data from combustion MHD lasma New technique for time series analysis combining the maximum entropy method and non-linear least squares method: its value in heart rate variability analysis Liu Liu and Jin Yue designed the study and performed the research; Xin Lai provided the data; Xin Lai, Jianping Huang and Jian Zhang discussed the experiment and the related issues in data analysis parts; Liu Liu and Jin Yue wrote the manuscript. All authors reviewed the manuscript. The authors declare no competing interests. Supplementary information is available for this paper at https://doi.org/10.1038/s41598-019-53908-6.Correspondence and requests for materials should be addressed to X.L.Reprints and permissions information is available at www.nature.com/reprints.Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.