key: cord-0039711-h4uao6xs authors: Moore, Andrew W.; Anderson, Brigham; Das, Kaustav; Wong, Weng-Keen title: Combining Multiple Signals for Biosurveillance date: 2007-09-02 journal: Handbook of Biosurveillance DOI: 10.1016/b978-012369378-5/50017-x sha: e49a0e8ce4d7862eca0ee95e6b1c5f51b3e66f0e doc_id: 39711 cord_uid: h4uao6xs nan This chapter begins with two simple illustrations of the power of combining multiple data sources for surveillance. We then survey four representative multivariate approaches. The first two (multiple regression and the Hotelling T-squared test) are conventional and time-honored statistical approaches. Next, we describe how a famous probabilistic approach called hidden Markov models can be used for outbreak diagnosis from many data streams. Finally, we discuss WSARE, a system that searches for anomalous subsets of multivariate records. Multiple independent sources of information can increase both sensitivity and specificity. Let us begin with a drastically oversimplified example. Suppose Sensor A has a daily 90% chance of signaling an attack (event SIGA) if one occurs, and a 1% chance of signaling an attack when there is none. Suppose sensor B monitors an independent data source, and has a daily 90% chance of signaling an attack (event SIGB) if one occurs, and a 1% chance of signaling an attack when there is none. Writing ATT as the event of attack, we have: P(SIGAI---ATT) = 0.01 P(SIGAIATT) = 0.9 P(SIGBI.--ATT) = 0.01 P(SIGBIATT) = 0.9 Then there is now a 99% chance that at least one detector will signal if there is an attack, and there is only a probability of 1 in 10,000 that both detectors will signal if there is no attack. Thus, in many situations (both the [SIGA and SIGB] case and the [~-SIGA and -SIGB] case), the operational decision is much clearer. Even in the case of inconsistent signals, the analysis task can be better informed. Another compelling example concerns time series analysis. Figure 15 .1 shows two times series, for daily sales of two fictional products in a fictional city. Apart from a general upward trend, no serious anomalies stand out. If, however, we look at the same data in a different way, February 20 stands out as somewhat anomalous. The new view is a scatterplot. For each date, it plots one data point, with the x-coordinate denoting sales of Product A, and the y-coordinate denoting sales of Product B. There is a general correlation in sales, but February 20 is atypical because B sales are high, taking into account A's sales. In this chapter, we briefly survey methods which can notice effects that are revealed by inspecting more than one time series at a time. Multiple regression is a statistical technique that can be used to analyze the relationship between a single dependent variable and several independent variables. The objective of multiple regression analysis is to use the independent variables whose values are known to predict the value of the single dependent value. Each predictor value is weighed, the weights denoting their relative contribution to the overall prediction. Here Y is the dependent variable, and X 1 .... ,X n are the n independent variables. In calculating the weights, a, bl .... ,bn, regression analysis ensures maximal prediction of the dependent variable from the set of independent variables. This is usually done by least squares estimation. This approach can be applied to analyze multivariate time series data when one of the variables is dependent on a set of other variables. We can model the dependent variable Y on the set of independent variables. At any time instant when we are given the values of the independent variables, we can predict the value of Y from Eq. 1. In time series analysis, it is possible to do regression analysis against a set of past values of the variables. This is known as autoregression (AR). Let us consider n variables. We have a time series corresponding to each variable. At time t, the vector Zt represents the values of the n variables. The general autoregressive model assumes that Zt can be represented as: where each Ai (an n x n matrix) is the autoregression coefficient. Zt is the column vector of length n, denoting the values of the time series variables at time t. p is the order of the filter which is generally much less than the length of the series. The noise term or residual, Et, is almost always assumed to be Gaussian white noise. In a more general case, we can consider that the values Here B1 ... Bp, (each an n x n matrix), are the M A coefficients. These coefficients can be determined using the standard Box-Jenkins methodology (Box et al., 1994) . A R and A R M A provide a nice way to predict what should be happening to all of the time series simultaneously, and they allow one time series to use other series to increase their accuracy. The Hotelling T 2 test provides a statistical test of the deviation of the recent mean of a set of signals from the current time period relative to their expected values, accounting for known correlation between the signals. Let us consider a data set with p variables and n samples from the recent past. Let Xi, a vector of length p, denote the ith sample. The Hotelling T 2 statistic is defined as: where X is the sample mean vector of the n recent samples, S is the sample variance--covariance matrix and go is the expected mean vector. For example, in Figure 15 .2, February 20 would be the data point with the largest T 2 value. It is well known that T 2 is distributed as p ( n -1 ) n -p F(p'n-P) ' where F represents the Fisher F distribution. The Hotelling T 2 test can be applied to a multivariate time series to detect whether a n e w datapoint has a substantial deviation from an expected value. The expected value go and the covariance matrix S can be computed from historical data. For example, we can take the mean and covariance of all the data seen so far. A more sophisticated approach, which accounts for gradual drift in the process is to model the time series (using A R or M A as mentioned before) and compute go and S from this model. At any time t, we consider the past n values of the p m variables. Let X, a vector of size p, be the sample m e a n of these n values. A n d let go be the expected mean vector. m We can test the null hypothesis that X = go by computing T 2 (from Eq. 4) and comparing with p(n-1) ' n-p Fa(p'n-P) for a suitably chosen or. If the T 2 statistic exceeds the value of p ( n -1) np ftx(P'n-P)' we can reject the null hypothesis with a confidence of ( l -a ) and signal an alarm. Several biosurveillance research groups have made significant successful use of Hotelling methods. Good examples, within the E S S E N C E framework, are described in Burkom et al. (2004 Burkom et al. ( , 2005 . Hidden Markov models (HMMs) are fundamental tools in fields such as bioinformatics, signal processing, and machine learning. In biosurveillance, Le Strat and Carrat (1999) used HMMs to monitor influenza-like illnesses and poliomyelitis using a Gaussian observation model. Rath et al. (2003) improved their model by, among other things, replacing the Gaussian observation model with an exponential distribution. Other applications include Cooper and Lipsitch (2004) modeling hospital infections with HMMs. Madigan (2005) reviewed the literature and discussed issues such as model selection and random observation-time HMM. He also noted the excellent fit between the capabilities of HMMs and the requirements of multivariate time series. Figure 15 .3 shows an HMM. Arrows indicate causality, shaded variables are observed, and the unshaded state variables are unobserved. At time t, the underlying disease state is St, which has N possible values. The observation Ot is a vector of K observations, which can assume count values, such as specific over-the-counter (OTC) sales, E R visits, and school absenteeism. The third variable type is for environmental influences, Et, on the observations. The environment variable is a vector of categorical variables, and has no effect on the disease state. The most common epidemiologic H M M is binary; it has two states: disease and no-disease. It is, however, quite easy to track the stages of multiple diseases by assigning one state to every possible combination of disease. Thus, if we are tracking four diseases with three stages each, the H M M will have 34 + 1 = 82 states. The number of states is much reduced if we assume that especially rare diseases will not co-occur. The word "Markov" in "hidden Markov model" indicates the assumption that state at time t depends only on the state at time t-1. The probability of transitioning from one state to another is thus governed by the transition model P(St+IlSt) , which is an N x N matrix derived analytically from estimates of prior probabilities and estimates of disease stage lengths. At a high level, the observation model described how we expect the signals to be affected by the disease state and environmental conditions. For example, if the observation vector includes cough/cold sales as one of its components, then the observation model might predict sales of one cough product per household per six months under normality, but one per household per two months during an influenza outbreak. The counts in the observation vectors are driven by more than the disease state. Additional observed information, such as day of week, weather, promotions (in the case of retail data), vacations (for school absenteeism), and other external factors are expected to influence the observed counts. For example, the HMM could expect, with high probability, to observe zero counts for school absenteeism on public holidays, and a 20% increase in OTC sales at stores with promotional activity. The following simulation illustrates the benefits of fusion of multiple data streams. For details of such an HMM approach, see the references above. The data in the first three panels of Figure 15 .4 were generated from a vastly oversimplified HMM where there are three possible disease states: "none," "influenza," and "allergy." The simulated data streams are ED visits, school absenteeism, and OTC antihistamine sales. In this example, the ED visits have a strong weekend effect, while weekend data are missing for school absenteeism. "Influenza" is assumed to double the rates of all three data streams. "Allergy" also doubles OTC antihistamines. The population is in state "allergy" during time steps 10-20, and in "influenza" for time steps 30 to 40. A three-state HMM was applied to the data and the inference results are shown in the bottom panel of Figure 15 .4. Inference involves computing the disease state probabilities (which are not in the observed data) from the counts and environmental factors (which are in the observed data). Note that the HMM is not told when allergy is present. In Figure 15 .4, note that the model was successful in explaining the jump in OTC sales and missing absenteeism data did not noticeably impair inference. Individually, these data streams would be inconclusive, or worse, misleading, but the proper statistical model effectively fuses them into a robust and highly informative resource. This example is very simplified, but illustrates several of the components of a realistic multivariate system. The what's strange about recent events (WSARE) algorithm (Wong et al., 2002 , Wong and Moore, 2002 is a rule-based anomaly pattern detector that operates on discrete, multidimensional data sets with a temporal component. This algorithm compares recent data against a baseline distribution with the aim of finding rules that summarize significant patterns of anomalies. Each rule is made up of components of the form Xi = V{, where Xi is the ith feature and V[ is the jth value of that feature. Multiple components are joined together by a logical AND. For example, a F I G U R E 15.4 Synthetic time series and diagnostic probabilities described in the text. These rules should not be interpreted as rules from a logic-based system in which the rules have an antecedent and a consequent. Rather, these rules can be thought of as SQL SELECT queries because they identify a subset of the data having records with attributes that match the components of the rule. WSARE finds these subsets whose proportions have changed the most between recent data and the baseline. We will overview versions 2.0 and 3.0 of the WSARE algorithm. These two algorithms only differ in how they create the baseline distribution; all other steps in the WSARE framework remain identical. WSARE 2.0 uses raw historical data from selected days as the baseline while WSARE 3.0 models the baseline distribution using a Bayesian network. The basic question asked by anomaly detection systems is whether anything strange has occurred in recent events. This question requires defining what it means to be recent and what it means to be strange. The WSARE algorithm (Wong and Moore, 2002 , Wong et al., 2002 is one example of an algorithm that attempts to do this. It defines all patient records failing on the current day under evaluation to be recent events. Note that this definition of recent is not restrictive--the approach is fully general and "recent" can be defined to include all events within some other time period such as over the last six hours. In order to define an anomaly, we need to establish the concept of something being normal. In WSARE version 2.0, baseline behavior is assumed to be captured by raw historical data from the same day of week in order to avoid environmental effects such as weekend versus weekday differences in the number of ED cases. This baseline period must be chosen from a time period similar to the current day. This can be achieved by being close enough to the current day to capture any seasonal or recent trends. On the other hand, the baseline period must also be sufficiently distant from the current day. This distance is required in case an outbreak happens on the current day but it remains undetected. If the baseline period is too close to the current day, the baseline period will quickly incorporate the outbreak cases as time progresses. In the description of WSARE 2.0 below, we assume that baseline behavior is captured by records that are 35, 42, 49, and 56 days prior to the day under consideration. We would like to emphasize that this baseline period is only used as an example; it can be easily modified to another time period without major changes to the algorithm. In a later section, we will illustrate how version 3.0 of WSARE automatically generates the baseline using a Bayesian network. We will refer to the events that fit a certain rule for the current day as Crece,,t. Similarly, the number of cases matching the same rule from the baseline period will be called Cbase~i,,e. As an example, suppose the current day is Tuesday F I G U R E 15.5 The baseline for WSARE 2.0 when the current day is December 30, 2003 . December 30, 2003 . The baseline used for WSARE 2.0 will then be November 4, 11, 18 and 25 of 2003 as seen in Figure 15 .5. These dates are all from Tuesdays in order to avoid day of week variation. In order to illustrate this algorithm, suppose we have a large database of one million ED records over a two-year span. This database contains roughly 1370 records a day. Suppose we treat all records within the last 24 hours as "recent" events. In addition, we can build a baseline data set out of all cases from exactly 35, 42, 49, and 56 days prior to the current day. We then combine the recent and baseline data to form a record subset called D B , which will have approximately 5000 records. The algorithm proceeds as follows. We first consider all possible one-component rules. For every possible featuremvalue combination, obtain the counts Crece, t and Cbaseli, e from the data set D B . As an example, suppose the Each contingency table is scored according to whether there is a significant increase in the fraction of records matching the rule in the recent data compared with the historical data. There many possible scoring functions: these are discussed in . By default, WSARE uses the Fisher exact test, described in the same paper. WSARE searches over all possible rules that can be made out of attributes of the database records. In a typical ED data set the rules would thus include both general and specific examples, for example: HomeLocation=NW, Gender=Male, AgeGroup=Over60, Syndrome=GI, InternalBleeding=True, HomeZip=12345. There are many anomalous scenarios that would not be detected by single-component rules. For example, there might be a slight increase in pediatric cases throughout the city and a slight increase in cases from zip code 54321, but a dramatic increase in pediatric cases from 54321. For this reason, WSARE also searches for rules made up of multiple components. When reporting the significance of the highest scoring rule, it is essential that we take into account the intensity of WSARE's search for anomalies. Even if surveillance data were generated randomly, entirely from a null distribution, we can expect that the best rule would be alarmingly high-scoring if we had searched over 1000 possible rules. In order to illustrate this point, suppose we follow the standard practice of rejecting the null hypothesis when the p-value is >a, where a = 0.05. In the case of a single hypothesis test, the probability of a false positive under the null hypothesis would be a, which equals 0.05. On the other hand, if we perform 1000 hypothesis tests, one for each possible rule under consideration, then the probability of a false positive could be as bad as 1 -(1 -0.05) l~176176 1, which is much greater than 0.05 (Miller et al., 2005) . Thus, if the algorithm returns a large score, we cannot accept it at face value without adding an adjustment for the multiple hypothesis tests we performed. This problem can be addressed using a Bonferroni correction (Bonferroni, 1936) , but this approach would be unnecessarily conservative. Instead, we turn to a randomization test in which the date and each ED case features are assumed to be independent. In this test, the case features in the data set DB remain the same for each record but the date field is shuffled between records from the current day and records from five to eight weeks ago. The full method for the randomization test is shown below. Let UCS = Uncompensated score: the score of the best scoring rule BR as defined above. For j = 1 to 1000: CPV is an estimate of the chance that we would have seen an uncompensated score as large as UCS if in fact there was no relationship between date and case features. Note that we do not use the uncompensated score UCS after the randomization test. Instead, the compensated p-value CPV is used do determine the level of anomaly. Many detection algorithms (Goldenberg et al., 2002 , Zhang et al., 2003 , Fawcett and Provost, 1997 assume that the observed data consist of cases from background activity, which we will refer to as the baseline, plus any cases from irregular behavior. Under this assumption, detection algorithms operate by subtracting away the baseline from recent data and raising an alarm if the deviations from the baseline are significant. The challenge facing all such systems is to estimate the baseline distribution using data from historical data. In general, determining this distribution is extremely difficult due to the different trends present in surveillance data. Seasonal variations in weather and temperature can dramatically alter the distribution of surveillance data. For example, influenza season typically occurs during mid-winter, resulting in an increase in ED cases involving respiratory problems. Disease outbreak detectors intended to detect epidemics such as severe acute respiratory syndrome (SARS), West Nile virus and anthrax are not interested in detecting the onset of flu season and would be confused by influenza. Dayof-week variations make up another periodic trend. In WSARE 2.0, we made the baseline distribution to be raw data obtained from selected historical days. WSARE 3.0 instead learns the baseline distribution from historical data in order to better account for observed environmental factors for the current day (e.g., public holiday, day of week, weather). WSARE 3.0 takes all records prior to the past 24 hours and builds a Bayesian network from this subset. Bayesian networks are overviewed in Chapters 13 and 18, and an introductory tutorial is available at www.cs.cmu.edu/-awm/ tutorials/ bayesnet.html. During the structure learning, we differentiate between environmental attributes, which are features such as the season and the day of week that cause trends in the data and response attributes, which are the remaining features. The environmental attributes are specified by the user based on the user's knowledge of the problem domain. Our evaluation criteria examines algorithm performance on an AMOC curve (Fawcett and Provost, 1997) . AMOC curves are described in more detail in Chapter 20. These results were generated from 100 synthetic data sets described in the WSARE paper , and are available at www.cs.cmu.edu/ --awm/wsare-data. The AMOC curve in Figure 15 .6 plots detection time versus false positives over alarm thresholds ranging from 0 to 0.2 in 0.001 increments. The lower alarm thresholds yield lower false positives and higher detection times while the converse is true with higher alarm thresholds. The best possible detection time is one day, as shown by the dotted line at the bottom of the graph. We add a one-day delay to all detection times to simulate reality where current data is only available after a 24-hour delay. Any alert occurring before the start of the simulated anthrax attack is treated as a false positive. Detection time is calculated as the first alert raised after the release date. If no alerts are raised after the release, the detection time is set to 14 days. The Israel Center for Disease Control evaluated WSARE 3.0 retrospectively using an unusual outbreak of influenza type B that occurred in an elementary school in central Israel (Kaufman et al., 2004) . WSARE 3.0 was applied to patient visits to community clinics between the dates of May 24, 2004 to June 11, 2004. The attributes in this data set included the visit date, area code, ICD-9 code, age category, and day of week. The day of week was used as the only environmental attribute. WSARE 3.0 reported two rules with p-values at 0.002 and five other rules with p-values below 0.0001. Two of the five anomalous patterns with p-values below 0.0001 corresponded to the influenza outbreak in the data. The rules that characterized the two anomalous patterns consisted of the same three attributes of ICD-9 code, area code and age category, indicating that an anomalous pattern was found involving children aged 6 to 14 having viral symptoms within a specific geographic area. WSARE 3.0 successfully detected the outbreak on the second day from its onset. The WSARE algorithms approach the problem of early outbreak detection on multivariate surveillance data using two key components. In WSARE 2.0, the main component is an association rule search, which is used to find anomalous patterns between a recent data set and a baseline data set. . The contribution of this rule search is best seen by considering the alternate approach of monitoring a univariate signal. If an attribute or combination of attributes is known to be an effective signal for the presence of a certain disease, then a univariate detector that monitors this signal will be an effective early warning system for that specific disease. However, if such a signal is not known a priori, then the association rule search will determine which attributes are of interest. We intend WSARE to be a general purpose safety net to be used in combination with a suite of specific disease detectors. The key to this safety net is to perform nonspecific disease detection and notice any unexpected patterns. A number of other multivariate methods for syndromic surveillance have been proposed. Burkom et al. (2004 Burkom et al. ( , 2005 ) discuss many such methods, including multivariate versions of the exponentially weighted moving average and cumulative sum methods. These methods are compared to methods that combine the output of multiple univariate detectors, for example using a Bayesian network. WSARE and additional biosurveillance software are available from www.autonlab.org. Example synthetic data sets are available from www.cs.cmu.edu/-awm/wsare-data. Pubblicazioni del R Istituto Superiore di Time Series Analysis: Forecasting and Control Public health monitoring tools for multiple data streams The role of data aggregation in biosurveillance detection strategies with applications from ESSENCE The analysis of hospital infection data using hidden Markov models Bayesian Biosurveillance of Disease Outbreaks Adaptive fraud detection Early statistical detection of anthrax outbreaks by tracking over-the-counter medication sales Evaluation of Syndromic Surveillance for Early Detection of Bioterrorism Using a Localized, Summer Outbreak of Influenza Monitoring epidemiologic surveillance data using hidden Markov models Bayesian Data Mining for Health Surveillance, in Spatial and Syndromic Surveillance for Public Health Controlling the false discovery rate in astrophysical data analysis Optimal reinsertion: A new search operator for accelerated and more accurate Bayesian network structure learning Automated detection of influenza epidemics with hidden Markov models Data Mining Jbr Early Disease Outbreak Detection Efficient Algorithms for Non-Parametric Clustering with Clutter Rule-based anomaly pattern detection for detecting disease outbreaks Bayesian network anomaly pattern detection for disease outbreaks Detection of outbreaks from time series data using wavelet transform