key: cord-0943098-e7r8v8mn
authors: Marano, Stefano; Sayed, Ali H.
title: Decision-Making Algorithms for Learning and Adaptation with Application to COVID-19 Data
date: 2021-12-07
journal: Signal Processing
DOI: 10.1016/j.sigpro.2021.108426
sha: a70785c2175592105342d38443766a66b3103cb9
doc_id: 943098
cord_uid: e7r8v8mn

This work focuses on the development of a new family of decision-making algorithms for adaptation and learning, which are specifically tailored to decision problems and are constructed by building up on first principles from decision theory. A key observation is that estimation and decision problems are structurally different and, therefore, algorithms that have proven successful for the former need not perform well when adjusted for the latter. Exploiting classical tools from quickest detection, we propose a tailored version of Page’s test, referred to as BLLR (barrier log-likelihood ratio) algorithm, and demonstrate its applicability to real-data from the COVID-19 pandemic in Italy. The results illustrate the ability of the design tool to track the different phases of the outbreak.

With the increasing availability of streaming data, modern applications of statistical inference have been faced with new challenges, resulting in novel operative modalities. Traditionally, statistical inference has been focused on estimation and decision problems based on observations coming from a homogeneous/stationary source; the task of the system is to learn from that environment. The challenge of modern systems is to infuse the learning task with new online adaptation capabilities, which are critical in the presence of changes in statistical conditions, drifting environmental characteristics and time-varying data acquisition/storage/transmission modalities. As learning and adaptation are essentially contrasting requirements, it is clear that the data deluge has imposed an important change of perspective.

In learning and adaptation contexts, estimation problems were considered first, which led to the adoption of the LMS (least-mean-square) algorithm for the adaptation step, because of its well-known adaptation properties. When dealing with decision problems, it appears natural to maintain the LMS protocol as the basic engine due to its simplicity in order to track drifts in the state of nature. This trend is evident in distributed decision systems based on consensus strategies [1] [2] [3] [4] [5] [6] [7] or diffusion strategies [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] . Among the latter class, the most successful strategy is the ATC (adapt-then-combine) fusion rule. This strategy is conceived in the form of LMS iterates followed by a convex combination of states [15] [16] [17] [18] [19] . Thus, the state-of-theart of modern decision systems is built on an estimation-based engine, which may represent a challenge for decision systems operating under the online learning and adaptation paradigm. The fundamental question we pose is if the performance of LMS-based decision systems can be improved by replacing their LMS component with algorithms specifically designed for solving decision tasks. Such modification has the potential of prompting new system architectures and design guidelines.

In this paper we design alternative decision-making algorithms that are specifically tailored to decision problems, by building up on first principles from decision theory rather than relying directly on the LMS update. Borrowing ideas from quickest detection, we introduce performance indexes that quantify the tradeoff between learning and adaptation and derive closed-form analytical expressions for the system performance. Using these performance figures, we show that the proposed updating rule outperforms the one based on the LMS iteration. In this contribution, no network aspects are considered and the focus is on the operation of a single agent. In addition, we limit our study to the case of two possible states of nature, say H 0 and H 1 , which are known to the decision maker. This means that at any time epoch, the observations collected by the agent are independently drawn from one of two distributions, but we do not know which, and this underlying distribution is allowed to change at any time according to arbitrary patterns of the kind . . .

The final part of this article is devoted to a real-word application of the proposed decision procedure. The COVID-19 pandemic due to the new SARS-CoV-2 virus is one of the most serious global crisis of the last decades. One key aspect to counteract this threat is to identify growing/shrinking pandemic phases in a timely manner. Our approach is that such a problem, after identifying the growing/shrinking phases with the statistical hypotheses H 1 and H 0 , respectively, naturally fits into the framework of binary decisions under the learning and adaptation paradigm. We illustrate that the technique developed in this paper is useful in tracking different phases of the COVID-19 pandemic. As an example, the proposed tool is demonstrated on pandemic data from Italy recorded from the beginning of the pandemic in late February 2020, to the end of July 2021.

Boldface symbols denote random scalars/vectors and normal font their realizations and deterministic quantities. For scalars, the time index (or algorithm iteration number) is enclosed in parentheses. Thus, for instance, x(n) denotes the random scalar x at time n. Conversely, in the case of vectors, the time dependence is indicated by a subscript, as, for example, u i denotes a random vector u evaluated at time i. Superscript T denotes vector transposition. Statistical expectation, variance, and probability operators are denoted by E, V, and P, respectively. They always are computed under the hypothesis in force H h , and the pertinent subscript h = 0, 1, is usually added to the operator symbol.

The remaining part of this article is organized as follows. Section 2 discusses the genesis of LMS in decision contexts. The proposed alternative to LMS is presented in Sec. 3 and its performance is investigated in Sec. 5 in terms of the criteria discussed in Sec. 4. Examples using synthetic data are given in Sec. 6 while an application to COVID-19 pandemic time-series is discussed in Sec. 7. Section 8 contains conclusive remarks.

Let us start by considering an estimation problem. Let d ∈ be a zero-mean scalar random variable with variance Ed 2 > 0, and u ∈ M a zero-mean random vector with positive-definite covariance matrix Euu T > 0. The quantity d is unknown while u is observed. The goal is to solve the optimization problem min w J(w), where w ∈ M is a weight vector and J(w) : M → represents a cost function that quantifies the penalty incurred when the unknown d is replaced by the linear transformation u T w of the observation. One common choice is the quadratic cost function J(w) = E(d − u T w) 2 , in which case the solution w o is given by w o = (Euu T ) −1 Edu, and the linear least-meansquare estimator of d given u is d = u T w o [21, Th. 8.1, p. 142] .

A recursive solution to the optimization problem min w J(w) with quadratic cost function is provided by the steepest-descent algorithm: set w 0 equal to some initialization vector, and iterate as follows:

where the step-size µ > 0 is sufficiently small (less than 2 divided by the largest eigenvalue of matrix Euu T ), see [21, Th. 8.2, p. 147] . It can be shown that Edu − Euu T w i−1 = −∇J(w i−1 ), which makes it possible to rewrite (1) in terms of the gradient vector ∇J(w i−1 ). The resulting expression is useful when alternative cost functions are used.

What is especially relevant in the adaptive framework is the consideration that the quantities Eu T u and Edu may not be known beforehand and are expected to vary over time. In these situations, assuming that we have access to streaming data in the form of a sequence of realizations {d(i), u i } i≥1 of d and u, a viable alternative to (1) is obtained if we drop the expectation signs and replace the random variables by their current realizations, yielding the following algorithm: set w 0 = some initial guess,

with a sufficiently small µ. This stochastic gradient approximation (because the true gradient is replaced by a noisy version thereof) is known as the LMS algorithm, see [21, Th. 10.1, p. 166 ]. The LMS algorithm learns the data statistics and at the same time is able to track statistical drifts, which are essential characteristics for the design of cognitive intelligent inference systems with learning and adaptation properties. We now move from an estimation to a decision context. Suppose M = 1, namely w i = w(i) and u i = u(i) are scalars, and suppose also u(i) = 1 for all i. By assuming independent and identically distributed (IID) data {d(i)} i≥1 , formal substitution in (2) gives: w(0) = 0,

Note that the right-hand side of (3) is a convex combination: (3), we get the output of the LMS algorithm in the form:

and we have

From (5), we see that the output of the algorithm approximates Ed when the number i of iterations is sufficiently large and the step-size µ is 1. This property, along with the inherent adaptation ability, motivated the use of (3) in decision problems. Indeed, the algorithm formalized in (3) represents the basic building block for the development of adaptation and learning diffusion algorithms over networks faced with decision problems, which has been addressed in a series of papers [15] [16] [17] [18] [19] . In these works, the term "learning" alludes to a decision about the current state of nature, among an ensemble of possible states characterized by known (not learned online) data distributions. The same meaning to "learning" is given in the present paper. It is useful to mention that the LMS has been advocated in the context of continuous inspection schemes and related control charts. As seen in (4), LMS employs exponentially-scaled weights, which is exactly the idea behind the geometric moving average control charts, see [ 

As discussed in the previous section, in the literature of adaptation and learning, decision problems have been approached by exploiting schemes and protocols initially conceived for estimation problems. Since decision and estimation problems are structurally different in many respects, it makes sense to start anew, with the goal of exploring possible alternatives to the LMS component with better performance for decision tasks. We argue that classical quickest detection tools, if properly adapted, are better suited to solve decision problems, than is the LMS algorithm shown in (3) . In the following, we advocate the use of repeated Page's tests to detect multiple changes in the state of nature. The approach is close in spirit to that adopted in the detection of transients, see, e.g., [24] [25] [26] [27] and references therein.

Let us start by considering a standard binary decision problem in which IID data {x(i)} i≥1 are observed and the following binary hypothesis test must be solved:

where f 1,0 (x) are the probability density functions (PDFs) of the data under the two hypotheses H 1 and H 0 , respectively. These PDFs are assumed to exist and are known to the decision maker. We adopt a non-Bayesian framework, in which the hypotheses represent two different probability spaces. In the complementary setting of the Bayesian formulation, we are given a single probability space with a-priori probabilities assigned to the hypotheses and the goal is to online update these priors by exploiting the data stream. See [28] and the references therein for a recent perspective and [29] for the Bayesian counterpart of the quickest detection procedures exploited in the following. It is well-known that under the most popular optimality criteria, the optimal decision maker for (6) exploits the loglikelihood of the data [30] , which is

In view of the IID property of the observations, the optimal decision based on vector [x(1), . . . , x(n)] T is

where γ is a suitable threshold, chosen according to the desired optimality criterion [30] . Expression (8) can be regarded as a random walk:

with step d(i). Note that the observation model in (6) is very general and the linearity of (9) is a consequence of (7) and the IID assumption. The following relationships are well known [the argument "(i)" is suppressed when non-essential]:

where

dx denote the two Kullback-Leibler (KL) distances (or divergences) between the PDFs f 1 (x) and f 0 (x) [31] . In (10) we have assumed that that these KL distances exist and are strictly positive, which implies that f 1 (x) and f 0 (x) are distinct over a set of nonzero probability.

Assumptions. The following assumptions are used throughout the paper. Under H h , h = 0, 1, the random variables {d(i)} i≥1 are continuous, with finite first-and second-order moments, and their probability distribution admits a density with respect to the usual Lebesgue measure. In addition,

In (8) , we see that the optimal decision maker compares to a threshold the value of a random walk with positive drift under H 1 and negative drift under H 0 . This optimal learning scheme is not adaptive, as is easily revealed by the following informal arguments. Suppose that the state of nature is H 1 for time steps 1 ≤ i ≤ n, and then switches to H 0 . Observations are IID under each hypothesis, but their distribution is different under the two hypotheses. Assuming n 1, with high probability the decision statistic z O (n) takes on very large values because the random walk is drifting to +∞ for 1 ≤ i ≤ n. For i > n the drift is negative but the random walk "starts" at z O (n), implying that the time required to approach the negative values that are typical of hypothesis H 0 is very large. A straightforward way to prevent {z O (i)} i≥0 from reaching extreme values is to introduce two barriers −a < 0 < b, as follows: z(0) = 0 and for i ≥ 1:

A more compact expression for the iteration in (11) is: z(0) = 0 and

The lower and upper barriers limit the range of values that |z(i)| takes on, and hence we can tradeoff adaptation and learning by a careful choice of a and b. Large values favor the learning (decision) ability of the system, while small values favor its adaptation ability. In the following, the decision procedure based on comparing (11) to a threshold γ ∈ (−a, b) will be referred to as the barrier log-likelihood ratio (BLLR) test: the BLLR decision at any arbitrary time epoch n ≥ 1 is

The BLLR decision procedure (13) , with z(n) shown in (11) or (12) , represents the proposed alternative to the one based on the LMS engine. For easy reference, the LMS and BLLR procedures are summarized in Algorithms 1 and 2, respectively. Note that the computational complexities of the two procedures are comparable. Essentially, at any iteration, only a function evaluation [to obtain d(i) from the measurement x(i)] and a weighted

; end sum (for LMS) or a sum followed by two threshold comparisons (BLLR) are required.

Typical choices for the threshold appearing in (13) are: γ = 0, which in the unbounded case of a, b → ∞ corresponds to the maximum likelihood (ML) decision criterion (also corresponding to the Bayesian criterion when the two hypotheses are equally likely); the mid-point threshold γ = b−a 2 ; or the value of γ for which the probability of deciding H 1 under state of nature H 0 takes on a prescribed value (false alarm criterion), as in the Neyman-Pearson formulation [30] . Likewise, when considering the LMS iterate (3), a test similar to (13) is used and the threshold is chosen with the same criteria. For LMS, the mid-point threshold is γ = (D 10 − D 01 )/2.

Suppose we know that H 0 is in force for 1 ≤ i ≤ n and H 1 is in force for all i > n, with IID data under each hypothesis, and the change epoch n is unknown. The celebrated Page's test for quickest detection of a change in the state of nature is obtained from (11) by setting a = 0 and letting b be the decision threshold: once the decision statistic hits the value b, the change in the state of nature H 0 → H 1 is declared and the test stops [22] . This reveals that BLLR test (11) is a generalization of Page's test. Indeed, the BLLR test in (11) can be seen as an infinite sequence of Page's tests for successively detecting the changes H 0 → H 1 → H 0 → H 1 . . . . To see this, let us assume that H 0 is true and set z(0) = −a. The BLLR test is equivalent to a Page's test with threshold a + b for detecting the change H 0 → H 1 , followed by a sign-reversed Page's test initialized at b, driven by negative drifts, with threshold a + b, to detect the successive change H 1 → H 0 , and so forth indefinitely. 1 In turn, Page's test can be regarded as a sequence of Wald's SPRTs (sequential probability ratio tests) [32, 33] , which reveals that the decision algorithm shown in (11) and (13) is a modified version of a sequence of SPRTs. Not surprisingly, the performance analysis of BLLR relies on standard results of sequential analysis, some results of which are collected in Appendix A-Appendix C, for self-consistency.

In view of the analogy with sequential analysis, our approach is close in spirit to the SPRT approach pursued in [34] for cooperative sensing. As done in [34] , in Sec. 7 we resort to the GLRT (generalized likelihood ratio test) approach to deal with the presence of unknown parameters. However, the nature 1 We are making the simplifying assumption that the hits at the thresholds are really due to change in the state of nature, not to error events. It is also assumed that the duration of a stage is sufficiently long: the analysis in the presence of repeated rapid switches between the hypotheses must be carried on by different tools.

of these parameters and the corresponding estimates are structurally different from those in [34] , resulting in substantially different decision procedures.

Introducing barriers to avoid extreme values of the likelihood also appears in the development of robust approaches to signal detection, as pioneered by Huber in 1965 [35] , see, e.g., [30, Sect. III.E.2]. Indeed, the limiting barriers in (11) can be regarded as a soft-limiter nonlinearity and the structure of robust detectors often involves suitably-designed, possibly time-varying, nonlinearities [36] . However, robust detection is concerned with uncertainly in the statistical model of the data, which is a different problem. Note that in our case the nonlinearity is applied to the random walk iterations not to the data, see (11) . Besides, time-varying barriers have also been proposed to implement nonparametric/truncated SPRTs [24, 36] ; again, these formulations address problems different from that under consideration.

The performance of quickest detectors are usually expressed in terms of mean time between false alarms and mean decision delay, see, e.g., [22] . In the following, we introduce modified versions of these quality indexes that are better suited to the learning and adaptation scenario under consideration. The analysis exploits standard tools from quickest detection and sequential analysis [22, 23, 32, 33, 37] , and results for random walks evolving in a strip [38, 39] .

We introduce two performance indexes: the error rate r, related to the learning capability, and the expected delay ∆ that quantifies the adaptation capability.

Let us consider the learning aspects first. Perhaps, the most natural performance figures would be the probability that lim i→∞ z(i) exceeds γ under H 0 , and the probability that lim i→∞ z(i) goes below γ under H 1 . In general, these steadystate probabilities are guaranteed to exist [38, 39] , however they are not easy to compute and do not lead to simple closedform expressions from which insights can be easily gained. We instead introduce performance figures whose computation is tractable. For h = 0, 1, consider the following quantity, defined 4 with the state of nature H h held fixed: 2

wherein the sign ≥ applies if z 0 < z 1 , and ≤ applies if z 0 > z 1 .

The quantity in (14) represents the expected time to reach the value z 1 starting from z 0 , under hypothesis H h . Using (14) , the learning ability of the system is quantified by the two indexes

and

The interpretation is as follows. The quantity T 0 (−a; γ) represents the expected time to cross the threshold γ, yielding a decision in favor of H 1 , when the system operates in H 0 steady-state and the BLLR iteration is initialized to z(0) = −a.

The value −a is referred to as the "typical" value taken by the statistic under H 0 . Likewise, T 1 (b; γ) represents the expected time to cross the threshold, yielding the H 0 decision, when the system operates in H 1 steady-state and z(0) = b, where b is the "typical" value under H 1 . Note that these quantities are related to -but different from -the expected time between false alarms and miss detections, respectively. The expected error time is defined in terms of the quantities in (15),

, and the error index quantifying the learning ability is its inverse, which we call the rate:

The second performance index ∆ quantifies the adaptation ability and is again defined in terms of T h (z 0 ; z 1 ) shown in (14) . Specifically, we consider:

and

For the decision statistic z(n) initialized at z(0) = −a, T 1 (−a; b) represents the expected time needed to hit for the first time the barrier b, under H 1 steady-state. Likewise, T 0 (b; −a) is the expected time for the decision statistic z(n), taking value b at epoch 0, to hit for the first time the barrier −a, under H 0 steadystate. The expected delay ∆ is defined as the arithmetic mean

In the previous discussion, the "typical" value of the statistic under H 0 is −a, and the "typical" value under H 1 is b. These choices are natural because −a and b are barriers. With these choices, as we shall see soon, we obtain simple closedform expressions for the operational characteristic (r, ∆) of the decision-maker.

However, when comparing the performance of BLLR with that of the LMS test, sensible performance indexes for the BLLR decision-maker are obtained by replacing in (15)- (18) the "typical" values of the decision statistic under the two hypotheses, by the corresponding expected values:

wherein we define z(∞) = lim i→∞ z(i). The distribution of z(∞) is investigated, e.g., in [38] .

The performance indexes of the LMS test are defined in a way similar to that of BLLR, with the notable difference that, in absence of barriers, one cannot define the typical values of the statistic under the two hypotheses as done before, and we instead rely upon expected values. To elaborate, assuming a steady-state hypothesis H h , let us introduce the quantity: (20) wherein the sign ≥ applies if w 0 < w 1 , and ≤ applies if w 0 > w 1 . Recall that {w(n)} n≥0 is defined in (3) and note the superscript to distinguish quantities related to the LMS test from the corresponding quantities referring to BLLR.

As error performance indexes for LMS we consider the quantities T 0 (−D 01 ; γ) and T 1 (D 10 ; γ). The rationale is obvious. For h = 0, 1, when the state of nature is H h and assuming that the iteration starts from E h [w(∞)], we compute the expected time required to cross the threshold and therefore decide for the opposite hypothesis H 1−h . Using T 0 (−D 01 ; γ) and T 1 (D 10 ; γ), we define the expected error time T err as the arithmetic mean of these two quantities, and the error rate as the inverse of T err :

Likewise, introducing T 1 (−D 01 ; D 10 ) and T 0 (D 10 ; −D 01 ), we define the expected delay as 

The performance of the BLLR test can be computed by borrowing results from the analysis of Page's test. To show this, it is convenient to introduce a version {z P (i)} i≥0 of the iteration (12) with arbitrary starting point c and a single lower barrier at 0. This is exactly the celebrated Page's test for quickest detection [40] : z P (0) = c ≥ 0, and

For γ P > 0, let us define the average run length (ARL):

computed under steady-state hypothesis H h , h = 0, 1.

In Appendix A, an exact expression for L h (c; γ P ) is derived, involving integral equations. Since not much physical insight is gained from these integral representations, we opt for relying on approximate, but simpler and closed-form, performance formulas for L h (0; γ P ). These formulas are derived in Appendix B, exploiting standard Wald's approximations [32, 33] . The final result is [22, Eq. 5.2.44] :

For large γ P , expressions (25) simplify to:

For Page's test, L 0 (0; γ P ) represents the mean time between false alarms and L 1 (0; γ P ) the worst mean delay for detection [22, Eqs. 5.2.18, 5.2.19] . Note the role played by the KL divergences D 10 and D 01 . The larger D 10 and D 01 are, the smaller L 1 (0; γ P ) and L 0 (0; γ P ) become, respectively. The former has a positive impact on the performance, the latter has a negative impact. Differently from the classical hypothesis testing problem where an increase of either or both D 10 and D 01 yields enhanced performance, in quickest detection problem enhanced performance is obtained by increasing D 10 and/or 1/D 01 , as seen in (25).

It is easy to express the performance indexes introduced in Sec. 4.1.1 in terms of the ARL L h (0; γ P ) shown in (25) . Consider first the case in which the random walk {z(i)} i≥0 starts at z(0) = −a. For the quantities on the left of (15) and (17) we have the obvious equalities:

When the random walk {z(i)} i≥0 starts at z(0) = b, we consider the reversed process {z − (i)} i≥0 obtained by replacing in (11) the sequence of log-likelihoods {d(i)} i≥1 with its signreversed counterpart {−d(i)} i≥1 . Then, for the quantity T 1 (b; γ) appearing on the right of (15), we have

where (29a) follows by symmetry; the minus sign " − " appended to the ARL in (29b) refers to a "reversed" Page's test in which the sequence {d(i)} i≥1 appearing in (23) is replaced by {−d(i)} i≥1 ; and (29c) follows by noting that the ARL L − 1 (0; γ P )

is the same of the ARL for a standard Page's test evolving under H 0 with steps whose expectation is D 10 . Similar arguments lead to

In the case that the threshold is at the midpoint between the barriers, γ = b−a 2 , the performance figures in (27)- (30) can be expressed in terms of the range R (b + a) of the detained random walk. Assuming γ = b−a 2 , recalling the definitions of expected error time and expected delay in (16) and (18), we get

where we have defined the effective divergence between the hypotheses as

The inverse of T err in (31) is the error rate:

For R 1, we obtain

yielding a simple expression for the operational characteristic (r, ∆) of the BLLR test:

The previous equations are insightful. First, the system performance depends on the statistical distributions of the hypotheses only through the effective divergence D eff in (33) , which is a combination of the two KL distances D 10 and D 01 . For D 01 = D 10 = D, D eff = D. If one of the two KL distances is much larger than the other, D eff reduces to twice the smallest. The effective divergence D eff appears in the operational characteristic r(∆) in (36) , which quantifies the fundamental trade-off of the decision procedure. Note that r(∆) is independent of the barriers a and b, as long as (a + b)

1. As it must be, the function r(∆) is strictly decreasing and convex in ∆. For a fixed ∆, r grows with D eff as long as D eff < ∆/2, while it is a decreasing function of D eff for D eff > ∆/2. This behavior is to be interpreted in light of the comments reported at the end of Sec. 4.2.

Simple closed-form approximations for the test performance, similar to those shown in (27)-(30), are not available in the case of the LMS iteration (3) . The technical difficulty is that {w(i)} i≥0 is not a random walk and the stopped martingale approach illustrated in Appendix B does not apply. However, the performance of LMS can be expressed by Fredholm integral equations similar to those in (A.3).

To show this, we follow the approach of [41] with reference to the error figure T 0 (−D 01 ; γ) defined in Sec. 4.1.2, see (20) . The hypothesis in force is H 0 , the iteration {w(i)} i≥0 is initialized to w(0) = −D 01 , and the event of crossing the threshold γ is considered. At the first step of the iteration, two mutually exclusive events can occur. Either d(1) causes a threshold crossing, i.e., w(1) = µd(1) + (1 − µ)w(0) > γ, an event whose probability we denote by p, or w(1) = µd(1) + (1 − µ)w(0) ≤ γ. In the latter case the iteration restarts from w(1) and the additional expected run length is given by µ , and f d|cnd (ξ) = 0 otherwise. We obtain

The average run length T 0 (w(0); γ) needed for the iteration {w(i)} i≥0 with initial value w(0) to exceed the threshold γ can be computed by solving numerically (37) . The numerical solution to (37) used in the examples discussed in Sec. 6 is motivated by the arguments provided in Appendix C. The quantity T 1 (−D 01 ; D 10 ) can be computed similarly to T 0 (−D 01 ; γ), while T 1 (D 10 ; γ) and T 0 (D 10 ; −D 01 ) require to consider the reversed random walk process whose steps are {−d(i)} i≥1 . The details are omitted.

Consider the following hypotheses with IID observations {x(i)} i≥1 : for i = 1, 2, . . . ,

It is easily seen that the log-likelihood is (32) and (34), while "BLLR theory (large R)" shows the large-R approximation (36) . The curve in gray labelled as "BLLR theory (correction)" refers to expressions (16) and (18) wherein a and b are replaced by the expected values as shown in (19), for a fairer comparison with the LMS scheme. "LMS simulation" shows the results of computer experiments involving 10 3 Monte Carlo runs, while the curve labelled by "LMS numerical" is obtained by solving numerically (37) as detailed in Appendix C.

a Gaussian distribution with mean m and variance σ 2 . In the present experiment we assume σ 2 = 1, m = 1/2 and γ = 0, which is also the midpoint threshold because D 10 = D 01 . For the LMS test, different values of the step-size in the range from 8.5 × 10 −3 to 0.3 are considered. In the case of the BLLR test, with little loss of generality, we set the barriers as follows:

In this way, we use a single parameter µ for both LMS and BLLR decision algorithms, with the meaning that smaller values of µ imply slower adaptation. Special attention is devoted to the slow-adaptation regime µ 1. Figure 1 shows the results of computer simulations for both decision schemes. For BLLR we also show the theoretical performance shown in (32) and (34) (theory), as well as the large-R expression given in (36) [theory (large R)]. For the LMS decision scheme we also show the performance obtained by solving numerically (37) as detailed in Appendix C (LMS numerical). The figure confirms the accuracy of the theoretical formulas for performance prediction.

The superiority of the BLLR decision algorithm is evident, at least in the regime of small adaptation (large values of ∆). However, recall from the discussion in Sec. 4.1.1 that a fair comparison between the two decision schemes requires to modify the performance indexes of the BLLR as indicated in (19) . The expectations shown in (19) have been computed numerically and the resulting operational characteristic is shown by the gray curve in Fig. 1 , labelled as "theory (correction)". The substantial superiority of BLLR in the low-adaptation regime is confirmed: for large ∆, we see that the rate r scales exponentially with the delay ∆ for both the decision schemes, but the exponent is substantially larger for the BLLR decision algorithm.

Recall the definition of the Gamma function: Γ(α) = ∞ 0 ξ α−1 e −ξ dξ, with α > 0. Let κ, θ > 0. With slight abuse of notation we use the symbol x ∼ Γ(κ, θ) to signify that x is a Gamma-distributed random variable whose PDF is

For x ∼ Γ(κ, θ), it follows that y = log x ∼ LΓ(κ, θ), which is called log-Gamma distribution, having the following PDF:

For y ∼ LΓ(κ, θ), we have Ey = log θ + ψ(κ), where ψ(κ) = d dx log Γ(x) is known as digamma function. We now consider the following hypotheses with IID observations {x(i)} i≥1 and ρ > 0:

Simple algebra shows that the corresponding log-likelihood is

, and therefore, with obvious notation:

yielding In this experiment the barriers for the BLLR decision scheme are

where µ is the step-size of the LMS algorithm, and the threshold is γ = b−a 2 . We assume θ = 1, κ = 10 and ρ = 1. The results of computer experiments for the BLLR and the LMS decision algorithms are shown in Fig. 2 . The comments are similar to those of the Gaussian example. In a nutshell: the formulas for performance prediction are very accurate and the BLLR algorithm outperforms LMS, at least in the smalladaptation regime of large delays.

Our last example involves exponentially distributed observations: x ∼ E(η) with PDF f E (x) = e −x/η η , for x > 0 and η > 0. The two hypotheses are

with

Defining η e = η1 η0 > 1, for the PDFs of the likelihood d one gets

and we have: D 10 = η e − 1 − log η e and D 01 = η −1 e − 1 + log η e . In this experiment we assume η 0 = 1 and η 1 = 1.5. As for the Gamma example, the barriers for the BLLR decision scheme are a = µ −1 D 01 , and b = µ −1 D 10 , where µ is the step-size of the LMS algorithm, and we use the mid-point threshold γ = b−a 2 . The results of computer simulations compared to the theoretical formulas are shown in Fig. 3 . The comments are similar to the previous cases, but in the exponential case the theoretical formulas are less accurate. The slope of the operational curve r(∆) seems correctly predicted by the analytical formulas, but a multiplicative correction appears to be necessary. This is a manifestation of the poor accuracy of Wald's approximations of neglecting the excess over the boundaries, which have been exploited to derive the theoretical formulas. Improvements in this regard are possible, e.g., via nonlinear renewal theory [23, Sec. 2.6], [37] , but not pursued here. In addition, in the exponential case, the theoretical operational characteristic of the LMS decision scheme is not reported because of instability of the numerical procedure detailed in Appendix C to solve (37).

The COVID-19 pandemic, caused by the appearance in late 2019 of a new coronavirus known as SARS-CoV-2, is one of the most serious global crises of the last decades. The multifaceted nature of the threat demands for an extraordinary effort from the scientific community to provide support to informed and rational decision making. In such scenario, signal processing may play an important role and several contributions from this community recently appeared in the literature, including [42] [43] [44] [45] co-authored by one of the present authors; please, see the references therein for useful entry-points to the topical literature. In particular, in [44, 45] , a quickest detection procedure -called MAST -is developed. MAST is aimed at detecting the onset of a pandemic phase characterized by an exponential growth of the contagion. The BLLR algorithm developed in the previous sections can be regarded as a repeated application of the MAST to successively detect the growing/shrinking pandemic phases. One difference is that the MAST is based on a surrogate of the log-likelihood (7), to comply with unknown parameters characterizing the statistics of real-world COVID-19 data. In the following, the analogous "GLRT" version of the BLLR is developed and its learning and adaptation properties are demonstrated on COVID-19 time-series and synthetic data.

Let us start by considering the classical SIR model of pandemic evolution introduced by Kermack and McKendrick in 1927 [46] . Let 3 s(t), i(t) and r(t) denote the fractions of susceptible, infected, and recovered (or dead) individuals, respectively. Let β be the infection rate (infected individuals per unit time), and γ the recovering rate. The celebrated SIR equations are [47] :

with the initial conditions r(0) = 0, s(0) = 1 − i(0). We assume 0 < i(0) 1, where i(0) represents the small fraction of the total population from which the infection originates. To motivate our model, let us focus on the situation in which the fraction of susceptible individuals can be considered approximately constant s(t) ≈ s * , implying that the second equation in (49) reduces to

Since data about the infections are typically collected on a daily basis, consider a discrete-time version of (50) with unit-step discretization (we loosely use the same notation i(·) for the time-discrete version):

for some i(0) > 0. From (51) we see that the ratio i(k)/i(k − 1) is constant. It is evident that the real-world phenomenon is governed by "random" versions of the previous deterministic equations. Accordingly, we model the ratio i(k)/i(k − 1) x(k) as a random variable. Precisely, we assume:

where {x(k)} k≥1 is a sequence of independent random variables.

In light of (53), x(k) is referred to as growth rate. By exploiting the publicly available data of the COVID-19 illness spread in Italy (freely downloadable at https://github.com/pcm-dpc/COVID-19/), we obtain 4 the sequence of growth rates {x(k)} k≥1 and verify that the x(k)'s are well-represented by Gaussian random variables; details can be found in [44, 45] . The expected value Ex(k) is time-varying and unknown, and characterizes the specific phase of the pandemic: when the pandemic is under control -a situation here referred to as hypothesis H 0 -we have m 0 (k) E 0 x(k) ≤ 1. Conversely, under the alternative hypothesis H 1 , m 1 (k) E 1 x(k) > 1 and the contagion grows exponentially fast. Upon observing the growth rate sequence {x(k)} n k=1 , the BLLR machinery can be exploited to online tracking the successive pandemic phases, by deciding between H 0 and H 1 at any time instant n.

Lacking knowledge of the sequences of the mean values {m 0 (k)} k≥1 and {m 1 (k)} k≥1 , one cannot compute the loglikelihood in (7) and the related BLLR statistic in (11) . We then resort to a GLRT approach. As it is well-known, the GLRT approach consists in replacing the unknown parameters appearing the data distributions with their ML estimates, computed by the data themselves separately for the two hypotheses H 0 and the H 1 ; details can be found in [30, 48, 49] . The approach is similar to that pursued in [34] in the context of SPRT problems, but the estimates are structurally different. In our case the number of unknown parameters grows linearly in time and the estimates are constrained to m 0 (k) ≤ 1 and m 1 (k) > 1, for all k. It is simple to see that

Computing the log-likelihood yields:

Substituting the ML estimates (54) and (55) in place of m 0 (k) and m 1 (k) in (56), one obtains

Using d(k) shown in (57), in place of (7), we get the following "GLRT" version of BLLR: z(0) = 0 and

Clearly, the definition of d(k) in (57) can also be used in the LMS iteration (3) to obtain a "GLRT" version of LMS. These "GLRT" versions are exactly as described in Algorithms 2 and 1, but take a different sequence{d(k)} in input. We now demonstrate the GLRT versions of BLLR and LMS on COVID-19 time-series recorded in Italy. Top panel of Fig. 4 shows the growth rate sequence {x(n)} n≥1 , where n denotes the index of the day corresponding to the date on the abscissa, from February 25 to July 18, 2021. Bottom panel shows the BLLR and LMS decision statistics computed from {x(k)} n k=1 . The standard deviation appearing in (57) is set to σ = 0.025. For BLLR, we use a = b = 10; for LMS, µ = 0.05. We assume zero threshold for both decision statistics. In the case of BLLR, the first passage H 1 → H 0 is declared on April 13, 2020, followed by passage H 0 → H 1 on July 19, a further H 1 → H 0 on November 27, and so forth. By inspection of Fig. 4 , the effectiveness in tracking pandemic phases seems confirmed. For LMS, the first occurrence of H 1 → H 0 is declared on May 4, followed by passage H 0 → H 1 on July 25, followed by H 1 → H 0 on December 20, and so forth. We see that BLLR tends to detect earlier the onset of the pandemic phase, namely reacts more promptly to the change of state of nature. Thus, BLLR appears to be a more sensible decision statistic with respect to LMS, even in terms of GLRT versions for real-world pandemic data. These conclusions are in line with the results of Sec. 6, and confirmed by the following performance analysis.

To investigate the performance of BLLR and LMS on COVID-19 pandemic data, we compute ∆ and r by computer experiments, assuming that the mean values of the data are IID random variables uniform in (1− , 1) under H 0 , and uniform in (1, 1 + ) under H 1 , for some 0 < < 1. Given the mean value, Gaussian data are generated with standard deviation σ = 0.025. In the case of BLLR, we explore various values of a = b. In the case of LMS, we compute numerically E 1 d = −E 0 d, and explore various values of µ. In both cases, 10 3 Monte Carlo runs are used and the mid-point threshold 0 is selected. The resulting operational characteristics are depicted in Fig. 5 . For both decision schemes, the performance worsens for smaller values of because the growing rate stays closer to unity in both pandemic phases, which makes it more difficult to distinguish between them. We also see that BLLR outperforms LMS, consistently with the results of Sec. 6, and its superiority is enhanced in the small-adaptation regime of large delays.

We propose a novel algorithm, called BLLR, for learning and adaptation in decision problems. The BLLR decision statistic consists of the data log-likelihood constrained to lie between pre-assigned thresholds and can be interpreted as an infinite sequence of Page's tests. Performance analysis reveals that BLLR can outperform state-of-the-art LMS-based algorithms, in the slow-adaptation regime. Application to COVID-19 pandemic data is discussed, using a "GLRT" version of BLLR. The proposed approach effectively tracks changes of pandemic phases, providing a rigorous tool to quickly detect the passage from a controlled regime in which the number of new positives tends to decrease or be stable, to a critical regime of pandemic explosion, and vice versa.

With the "GLRT" relaxation of BLLR, the statistical distribution of the data may contain unknown parameters. We believe that further relaxations of the formulation adopted in this paper, aimed at addressing non-parametric decision problems, is an interesting direction for future research. Also relevant to big-data applications is a data-centric approach with the decision statistic learned from previously-recorded datasets. Formulations in terms of particle filtering can be useful to address these generalizations. The extension of the proposed method to multi-hypothesis testing also deserves further analysis. Finally, our study is limited to a single decision maker and paves the way to further investigations aimed at designing the diffusion step for a network of interconnected decision makers, much in the same way as the ATC diffusion rule has been advocated to be used in combination with LMS.

Page's test can be regarded as a set of parallel open-ended Wald's SPRTs (see [32, 33] (24), L h (0; γ P ) can be written as the sum of two contributions. The first is the average sample number (ASN) of the SPRT for i ≥ 1 given that the SPRT hits back the level 0 before crossing γ P , multiplied the expected number of such "back to 0" cycles (a random quantity which is easily seen to follow a geometric distribution with range {0, 1, 2 . . . }, because any time the process hits 0 it probabilistically restarts). The second contribution is the ASN of the SPRT given that it crosses the threshold γ P before hitting back the 0 level. Denoting by p 0 the probability of hitting zero before crossing γ P (starting from 0), we see that the expected number of back-to-0 cycles is p 0 /(1 − p 0 ). Then, denoting by E h n 0 the ASN of the SPRT we have: Using L h (0; γ P ), a similar equation for the case of c > 0 can also be derived. For the SPRT with lower threshold at 0, let p c be the probability of hitting zero before crossing γ P , when the random walk starts from z P (0) = c, and let E h n c be the corresponding ASN. Starting from c, two mutually exclusive situations may occur: either z P (i) hits zero before crossing γ P , which happens with probability p c , or its complement occurs and z P (i) crosses γ P without hitting zero. In the former case the average run length equals the time needed to hit 0 plus L h (0; γ P ).

L h (c; γ P ) = p c E h [n c |back to 0] + L h (0; γ P )

It can be shown that p c and E h n c are solutions to the following Fredholm integral equations of the second kind [22, p. 168 ]: for 0 ≤ c ≤ γ P :

where f d (ξ) is the PDF of the log-likelihood d(i) shown in (7), under hypothesis H h (dependence not made explicit for notational simplicity). Among others, iterative methods have been proposed for solving these equations, see e.g., [22, Eq. 5.2.28] . After computing p c and E h n c , we obtain L h (c; γ P ) from (A.1) and (A.2), and the ARL is found.

Suppose the state of nature is H 0 . Consider the iteration in (9) where d(i) is the log-likelihood of the i-th observation defined in (7), and we have appended an "S" to indicate that we now consider a SPRT. Recall that observations {x(i)} i≥1 are IID. Clearly, z S (n) = n i=1 d(i). The random process e zS(n) is a martingale with respect to the sequence {d(i)} i≥1 because: and because the regularity condition E 0 |e zS(n) | < ∞, ∀n, is obviously met. For some pair of thresholds −γ < 0 < γ u , let n = inf i≥1 {i : z S (i) ≥ γ u or z S (i) ≤ −γ } (B.3) be a random time for the process e zS(n) . It can be shown that P 0 (n < ∞) = 1, e.g., by considering that the sequence of descending ladder heights of e zS(n) is non terminating, see [50] . up to convergence, if any. In the computer experiments presented in this work the initial guess g 0 (x) is taken as a constant function, whose value is equal to the value T 0 (−D 01 ; γ) obtained by simulations. The numerical evaluations of the performance for the LMS decision algorithm presented in this paper are based on (C.7) and similar iterates, using numerical integration for integrals.

Gossip algorithms: design, analysis and applications

Enforcing consensus while monitoring the environment in wireless sensor networks

Asymptotic optimality of running consensus in testing binary hypotheses

Convergence rate analysis of distributed gossip (linear parameter) estimation: Fundamental limits and tradeoffs

Distributed detection via Gaussian running consensus: Large deviations asymptotic analysis

Distributed detection over noisy networks: Large deviations analysis

Large deviations performance of consensus+innovations distributed detection with non-Gaussian observations

Diffusion strategies for adaptation and learning over networks

Adaptation, learning, and optimization over networks

Distributed detection over adaptive networks using diffusion adaptation

Adaptive networks

On the learning behavior of adaptive networks-Part I: Transient analysis

On the learning behavior of adaptive networks-Part II: Performance analysis

Excess-risk of distributed stochastic learners

Diffusion-based adaptive distributed detection: Steady-state performance in the slow adaptation regime

Distributed detection over adaptive networks: Refined asymptotics and the role of connectivity

Detection under one-bit messaging over adaptive networks

Adaptation and learning in multi-task decision systems

Decision learning and adaptation over multi-task networks

Multitask diffusion adaptation over networks with common latent representations

Adaptive Filters

Detection of Abrupt Changes: Theory and Application

Sequential Analysis. Hypothesis Testing and Changepoint Detection

Quickest detection procedures and transient signal detection

A detection optimal min-max test for transient signals

A variable threshold Page procedure for detection of transient signals

Some methods to evaluate the performance of Page's test as used to detect transient signals

Consistent online Gaussian process regression without the sample complexity bottleneck

Quickest Detection

An Introduction to Signal Detection and Estimation

Elements of Information Theory

Optimum Character of the Sequential Probability Ratio Test

Cooperative sensing via sequential detection

A Robust Version of the Probability Ratio Test

Robust techniques for signal processing: A survey

Sequential Analysis: Tests and Confidence Intervals

Certain asymptotic results for random walks in a strip

Siberian Branch, Academy of Sciences of the USSR

Continuous inspection schemes

A simple method for studying run-length distributions of exponentially weighted moving average charts

Quickest detection and forecast of pandemic outbreaks: Analysis of covid-19 waves

Adaptive Bayesian learning and forecasting of epidemic evolution-Data analysis of the COVID-19 outbreak

Quickest detection of COVID-19 pandemic onset

Decision support for the quickest detection of critical COVID-19 phases

A contribution to the mathematical theory of epidemics

A primer on stochastic epidemic models: Formulation, numerical simulation, and analysis

Testing Statistical Hypotheses

Fundamentals of Statistical Signal Processing

Stochastic Processes

Mathematical Analysis I

Integral mean value method for solving Fredholm integral equations of the second kind on the half line

 (37) In many cases of interest, an approximate numerical solution to (37) can be obtained by a simple iterative procedure, based on heuristic arguments. Let us start by considering a modified version of the problem in which the iteration {w(i)} i≥0 with initial point w(0) = −D 01 evolves up to cross one of the two thresholds γ < −D 01 and γ > −D 01 . The same arguments used to derive (37) lead to the Fredholm integral equation is a fixed point of the operator P, namely P(T 0 (x; γ)) = T 0 (x; γ). Thus, the problem of solving (C.1) reduces to the problem of finding such fixed point, provided that it exists and is unique. Picard-Banach fixed point principle states that a mapping P of a complete metric space into itself has a unique fixed point g * provided that the mapping is a contraction [52, Sec. 9.7] . The operator P is called a contraction iffor some 0 < δ < 1. In addition, for a contraction, the iterationconverges to the fixed point g * with a rate of convergencewhere g 0 ∈ C[−γ , γ] is some initial guess [52, Eq. (9.21)].Thus, (C.4) provides a simple numerical recipe to solve (C.1). To check that P is a contraction, using the supremum norm, note thatwhere in (C.6a) x * ∈ [−γ , γ] and the equality follows by the mean value theorem for integrals [53, Th. 5, p. 352] , while the first inequality in (C.6b) follows because the integrand in (C.6a) represents a valid PDF. The rigorous approach provided by the theory of fixed points does not apply directly to our case because in (37) we have γ → ∞ and the derivation leading to (C.6c) fails for unbounded intervals, in which the mean value theorem for integrals do not apply. Numerical approaches to the solution of Fredholm integral equations defined over unbounded intervals have been proposed, see [54] and the references therein. However, for the sake of simplicity, here we limit ourselves to apply (C.4), checking numerically the convergence. Summarizing, we use the following heuristic approach: start with some initial guess g 0 (x), x ∈ (−∞, γ], and iterate for n ≥ 1: