key: cord-0198917-3koaptwz authors: Vovk, Vladimir title: Testing for concept shift online date: 2020-12-28 journal: nan DOI: nan sha: 4efb5a9a0769b1fe4849014f641bac8e55828f81 doc_id: 198917 cord_uid: 3koaptwz This note continues study of exchangeability martingales, i.e., processes that are martingales under any exchangeable distribution for the observations. Such processes can be used for detecting violations of the IID assumption, which is commonly made in machine learning. Violations of the IID assumption are sometimes referred to as dataset shift, and dataset shift is sometimes subdivided into concept shift, covariate shift, etc. Our primary interest is in concept shift, but we will also discuss exchangeability martingales that decompose perfectly into two components one of which detects concept shift and the other detects what we call label shift. Our methods will be based on techniques of conformal prediction. The most standard way of testing statistical hypotheses is batch testing: we try to reject a given null hypothesis based on a batch of data. The alternative approach of online testing (see, e.g., [10] or [9] ) consists in constructing a nonnegative process that is a martingale under the null hypothesis. The ratio of the current value of such a process to its initial value can be interpreted as the amount of evidence found against the null hypothesis. The standard assumption in machine learning is the (general) IID assumption, sometimes referred to (especially in older literature) as the assumption of randomness: the observations are assumed to be independent and identically distributed, but nothing is assumed about the probability measure generating a single observation. Interestingly, there exist processes, exchangeability martingales, that are martingales under the IID assumption; they can be constructed (see, e.g., [14, Section 7.1] or [13] ) using the method of conformal prediction [14, Chapter 2] . Deviations from the IID assumption have become a popular topic of research in machine learning under the name of dataset shift [6, 7] ; in my terminology I will follow mostly [6] . Analysing general dataset shift is usually regarded as too challenging a problem, and researchers concentrate on restricted versions, with restrictions imposed on marginal or conditional probabilities associated with the probability measure generating a single observation. Different restrictions are appropriate for different kinds of learning problems. In this note we consider problems of classification, in which random observations (X, Y ) consist of objects X and labels Y , the latter taking a finite number of possible values. We will be interested in Y → X domains, in the terminology of [3] , in which the objects are causally dependent on the labels. Under the IID assumption, the consecutive pairs (X, Y ) have the same probability distribution P . There is a dataset shift if P in fact changes between observations. Let us say that there is a label shift if the marginal distribution P Y of Y under P changes. Finally, there is a concept shift if the conditional distribution P X|Y of X given Y changes. Later in this note we will adopt a wider understanding of a label shift. As an example, suppose we are interested in the differential diagnosis between cold, flu, and Covid-19 given a set of symptoms. Under a pure label shift, the properties of the three diseases do not change (there is no concept shift), and only their prevalence changes, perhaps due to epidemics and pandemics. Under a concept shift, one or more of the diseases change leading to different symptoms. Examples are new variants of Covid-19 and new strains of flu that appear every year. In general, exchangeability martingales may detect both label shift and concept shift. In some cases we might not be interested in label shift and only be interested in concept shift (or, perhaps less commonly, vice versa). The goal of this note is to develop and start investigating exchangeability martingales targeting only concept shift. It would be ideal to decompose the amount of evidence found by an exchangeability martingale for dataset shift into two components, one reflecting the amount of evidence found for concept shift and the other reflecting the amount of evidence found for label shift. Such decomposable martingales are our secondary object of study. New exchangeability martingales and their simple theoretical properties will be the topic of Section 2, and in Section 3 they will be applied to the well-known USPS dataset. The preliminary results reported in the latter section suggest that the exchangeability martingales constructed for this dataset in [14, Section 7.1] are dominated (and greatly improved) by an exchangeability martingale decomposable into a product of an exchangeability martingale for detecting concept shift and an exchangeability martingale for detecting label shift. The most obvious application of exchangeability martingales is to help in deciding when to retrain predictors, as discussed in [13] . We should be particularly worried about the changes that invalidate ROC analysis, which is the case of concept shift in a Y → X domain [3, 15] . Our exchangeability martingales for concept shift are designed to detect such dangerous changes. In the context of conformal prediction, concept shift in Y → X domains requires retraining label-conditional predictors [14, Section 4.5] . For connection between label-conditional predictors and ROC analysis, see [1, Section 2.7]. For a detailed review of conformal prediction see, e.g., [14] , but in this section I will mainly follow [1, Chapters 1 and 2] (for the generation of conformal pvalues) and [13] (for gambling against those p-values). As mentioned earlier, we consider observations z = (x, y) that consist of two components, the object x and the label y. Let X be the measurable space of all possible objects, and Y be the set of all possible labels. Set Z := X × Y; this is our observation space. We are interested in classification and so always assume |Y| < ∞; Y is always equipped with the discrete σ-algebra. A conformity measure A is a function that maps any finite sequence (z 1 , . . . , z n ) ∈ Z n of observations of any length n ∈ {1, 2, . . . } to a sequence (α 1 , . . . , α n ) ∈ R n of real numbers of the same length that is equivariant in the following sense: for any n ∈ {1, 2, . . . }, any permutation π : {1, . . . , n} → {1, . . . , n}, and any sequences (z 1 , . . . , z n ) ∈ Z n and (α, . . . , α n ) ∈ R n , (α 1 , . . . , α n ) = A (z 1 , . . . , z n ) =⇒ α π(1) , . . . , α π(n) = A z π(1) , . . . , z π(n) . (1) In our experiments in Section 3 we will only use conformity measures, but in theory we are also interested in the following generalization. A label-conditional conformity measure A is a function that maps any finite sequence (z 1 , . . . , z n ) ∈ Z n of observations of any length n ∈ {1, 2, . . . } to a sequence (α 1 , . . . , α n ) ∈ R n of real numbers of the same length that is label-conditionally equivariant: for any n ∈ {1, 2, . . . }, any permutation π : {1, . . . , n} → {1, . . . , n}, and any sequences (z 1 , . . . , z n ) = ((x 1 , y 1 ), . . . , (x n , y n )) ∈ Z n and (α, . . . , α n ) ∈ R n , y 1 = y π(1) , . . . , y n = y π(n) (α 1 , . . . , α n ) = A (z 1 , . . . , z n ) =⇒ α π(1) , . . . , α π(n) = A z π(1) , . . . , z π(n) . In other words, we only require (1) to hold for the permutations that leave the labels intact. The label-conditional conformal transducer associated with a labelconditional conformity measure A is the function p defined by p(z 1 , . . . , z n , τ ) := |{i : y i = y n ∧ α i < α n }| + τ |{i : where i ranges over 1, . . . , n, z i = (x i , y i ) for all i ∈ {1, . . . , n}, and τ ∈ [0, 1]. The values (2) will be referred to as p-values. If the labelconditional conformity measure A is in fact a conformity measure, we will say that the label-conditional conformal transducer p associated with it is simple. Let Z 1 , Z 2 , . . . be a sequence of random observations, i.e., random elements whose domain is a fixed probability space with probability measure P and which take values in the observation space Z. Each random observation Z n is a pair Z n = (X n , Y n ), where X n is a random object and Y n is a random label. Let us say that the random sequence of observations Z 1 , Z 2 , . . . is labelconditional exchangeable if, for any n ∈ {1, 2, . . . }, any sequence (y 1 , . . . , y n ) ∈ Y n , any sequence of measurable sets E 1 , . . . , E n in X, and any permutation π : {1, . . . , n} → {1, . . . , n}, This is an instance of de Finetti's [2] notion of partial exchangeability. The sequence Z 1 , Z 2 , . . . is exchangeable if, for any n ∈ {1, 2, . . . }, any sequence of measurable sets E 1 , . . . , E n in Z, and any permutation π : {1, . . . , n} → {1, . . . , n}, Of course, exchangeability is a stronger property than label-conditional exchangeability. is an independent sequence of independent random variables each distributed uniformly in [0, 1], and p is a label-conditional conformal transducer, the sequence of random p-values P n := p(Z 1 , . . . , Z n , τ n ), n = 1, 2, . . . , For a proof of Proposition 1, see [14, Section 8.7 ] (Proposition 1 is a special case of Theorem 8.1 in [14] ). If Z is a measurable space, Z * stands for the set of all finite sequences of elements of Z (equipped with the natural σ-algebra). It includes the empty sequence 2. A betting martingale is a measurable function F : (The three unusual features of this definition are that betting martingales are required to be nonnegative, start from 1, and are allowed to take value ∞.) The test martingale associated with the betting martingale F and a sequence (P 1 , P 2 , . . . ) uniformly distributed in [0, 1] ∞ (the input p-values) is the sequence of random variables S n = F (P 1 , . . . , P n ), n = 0, 1, . . . . The sequence (S n ) n=0,1,... is a nonnegative martingale, in the usual sense of probability theory [12, Definition 7.1.1], in its own filtration F n := σ(S 1 , . . . , S n ) or the filtration F n := σ(P 1 , . . . , P n ) generated by the input p-values. Intuitively, this martingale describes the evolution of the capital of a player who gambles against the hypothesis that the input p-values are distributed uniformly and independently. In this note we will be interested in three classes of martingales. The labelconditional conformal martingales are defined as the test martingales associated with any betting martingale F and a sequence (P 1 , P 2 , . . . ) defined by (4) (under the conditions of Proposition 1) as the input p-values. Label-conditional conformal martingales are main topic of this note. They detect concept shift. It was shown, once again, in [14, Section 7.1] that the USPS dataset is non-exchangeable, and in Section 3 we will explore sources of this lack of exchangeability. Remark 1. It is important that our exchangeability martingales for detecting concept shift can be used in situations where the labels are so far from being IID that it would be unusual to talk about label shift. Discussion of label shift usually presuppose at least approximate independence of labels. Suppose a sequence of hand-written characters x 1 , x 2 , . . . comes from a user writing a letter. The objects x n are matrices of pixels and the corresponding labels y n take values in the set {a, b, . . . }. Different instances of the same character, say "a", may well be exchangeable among themselves (even conditionally on knowing the full text of the letter), whereas the text itself will be far from IID; for example, "q" will be almost invariably followed by "u" if the letter is in English. For discussions of such partial exchangeability, see, e.g., [2] , [8] , and [14, Section 8.4 ]. In the rest of this section we will look for possible explanations of the difference between the amount of evidence found against concept shift and against exchangeability. We will see that in some situation the amount of evidence found against exchangeability decomposes into two components: • the amount of evidence found for concept shift; • the amount of evidence found for label shift. In these situations the second component can be said to explain the difference. A label conformity measure A is a conformity measure that satisfies, additionally, the following property: for any finite sequence (z 1 , . . . , z n ) ∈ Z n of observations of any length n ∈ {1, 2, . . . }, any sequence (α 1 , . . . , α n ) ∈ R n of real numbers of the same length, and any i, j ∈ {1, . . . , n}, where y i and y j are the labels in z i and z j , respectively. In other words, it assigns conformity scores only to the labels rather than to the full observations. (Notice that the requirement of equivariance only ensures (7) with "z i = z j " in place of "y i = y j ".) The conformal transducer associated with a conformity measure A outputs the p-values where i ∈ {1, . . . , n}, α 1 , . . . , α n are defined by (3), and τ ∈ [0, 1]. We will say that p is a label conformal transducer if A is a label conformity measure. Our method of decomposing exchangeability martingales will be based on the following result (version of Theorem 8.1 in [14] ). Its proof is given in Appendix A. Theorem 2. If the sequence of random observations Z 1 , Z 2 , . . . is exchangeable, (τ 1 , τ 2 , . . . ) and (τ 1 , τ 2 , . . . ) are independent (between themselves and of the observations) sequences distributed uniformly in [0, 1] ∞ , p is a simple labelconditional conformal transducer, and p is a label conformal transducer, the interleaved sequence of random p-values P 1 , P 1 , P 2 , P 2 , . . . , where P n := p(Z 1 , . . . , Z n , τ n ), P n := p (Z 1 , . . . , Z n , τ n ), A conformal martingale is defined to be the test martingale associated (via (6) , where F is a betting martingale) with a conformal transducer. If the underlying conformity measure is a label conformity measure, the conformal martingale will be called a label conformal martingale. We will say that a label-conditional conformal martingale is simple if its underlying label-conditional conformal transducer is simple. Having the stream of random p-values P 1 , P 1 , P 2 , P 2 , . . . produced as in Theorem 2, we can define two derivative exchangeability martingales: a labelconditional conformal martingale associated with P 1 , P 2 , . . . and a label conformal martingale associated with P 1 , P 2 , . . . . (There are no restrictions on the underlying betting martingales.) Corollary 3. The product of a simple label-conditional conformal martingale and a label conformal martingale with independent randomizations (i.e., their sequences of random numbers τ ) is an exchangeability martingale. Such product exchangeability martingales decompose perfectly into components for detecting concept shift and label shift. For a short proof of this corollary, see Appendix B. The dataset used in our experiment is the well-known USPS dataset of handwritten digits [14, Appendix B.1], which is known to be non-exchangeable. The Figure 1 . It plots n ∈ {0, . . . , 9298} vs the value S n of a conformal martingale with initial value 1 after processing the first n observations. The values of S n are given on the log (base 10) scale. The final value S 9298 exceeds 10 18 . The martingale is randomized, but its trajectory does not depend much on the seed used in the random number generator (and this will be true for all other conformal martingales discussed in this note). The conformity measure used in [14, Figure 7 .6] is of the nearest-neighbour type: namely, the conformity score α i of the ith observation (x i , y i ) in a sequence (x 1 , y 1 ), . . . , (x n , y n ) is defined as where . . . is Euclidean norm. (Using the tangent distance in place of the Euclidean distance x − x leads to similar results, for all experiments reported in this note, unlike the batch experiments in [13, Section 2] .) The conformity score (9) is the ratio of the (nearest) distance to another class to the distance to the same class (excluding the current observation). This conformity measure will be referred to as the ratio conformity measure. Later we will also be interested in modifications of this conformity measure. The betting martingale used in all our experiments is the Sleepy Jumper, as described in [14, Section 7.1 ]. I will not repeat the definition here, and only mention that it involves two parameters, R = 0.01 and J = 0.001. (Inevitably, there is some element of data snooping here, since these values were chosen because of their reasonable performance on the USPS dataset, but it is limited by the use of round figures.) Only these values of parameters will be used in this note. Each of the four exchangeability martingales in Figure 1 apart from the product (the blue martingale) is determined by three components: • the underlying conformity measure, which is either (9) or one of its modifications; • the transducer, which is either the label-conditional conformal transducer (2) or the conformal transducer (8); feeding the conformity measure of the previous item into this transducer we obtain a sequence of p-values; • the betting martingale F , which in this note is always the Sleepy Jumper; we feed the p-values resulting from the previous item into F , as per (6). The black martingale in the left panel of Figure 1 uses the conformity measure (9), the conformal transducer (8) , and the Sleepy Jumper. The black martingale may detect any deviations from exchangeability, but in this note we are particularly interested in concept shift. In our current context, concept shift means that, for some reason, the same digit (such as "0") starts looking different; perhaps people start writing digits differently, or the digits are scanned with different equipment. To detect concept shift, we use the same conformity measure (9), but feed it into the label-conditional conformal transducer (2); the resulting sequence of p-values is fed into the Sleepy Jumper, as usual. The resulting test martingale is shown in red in the left panel of Figure 1 . Its final value, of the order of magnitude 10 10 , is much less impressive than the final value of the black martingale, and the red martingale starts its climb towards its final value over the original test set (the last 2007 observations). There is, of course, another reason why exchangeability may be violated: we may have label shift. To detect it, we use the label conformity measure that assigns the conformity score to the ith observation (x i , y i ) in a sequence (x 1 , y 1 ), . . . , (x n , y n ). In other words, we average the conformity scores for each class to ensure the requirement of invariance (7). The label conformal martingale obtained by applying the Sleepy Jumper to the p-values produced by the label conformal transducer (8) applied to the conformity scores (10) is shown as the green line in the left panel of Figure 1 . It is interesting that, despite the invariance restriction, the final value of the green martingale is even greater than the final value of the black martingale. The dataset shift can be explained by just the label shift. According to Corollary 3, the product of a label-conditional conformal martingale and a label conformal martingale is still an exchangeability martingale. The product is shown as the blue line in the left panel of Figure 1 . By construction, the blue martingale is perfectly decomposable. Its final value greatly exceeds the previous record for the USPS dataset (the final value achieved by the black martingale). Remark 2. Corollary 3 has an important condition, "with independent randomizations". It is ignored in this version of the note, where the seed of the random number generator is always set to 1. Corollary 3 remains applicable to a high degree of approximation since the dependence on the seed of the random number generator is weak. This somewhat cavalier approach is likely to change when the Python code for this note is rewritten to comply with the recent changes in NumPy random number generation [5] . The blue exchangeability martingale, on the one hand, almost dominates the black martingale over the USPS dataset (namely, it dominates after approximately 3000 observations) and, on the other hand, decomposes into a product of exchangeability martingales for detecting concept shift and for detecting label shift. Therefore, the red and green pair in the left panel of Figure 1 appears to be a significant improvement over the black martingale. For other conformity measures we will often obtain results that are qualitatively different. For example, squaring the denominator of (9) will lead to the right panel of Figure 1 . The performance of the exchangeability martingale for detection of concept shift greatly improves over the original test set, but the price to pay is deterioration in the performance of the exchangeability martingale for detection of label shift. A more radical modification of (9) is obtained by replacing the numerator of (9) by 1; let us call it the same-class conformity measure. The resulting exchangeability martingales (still using the Sleepy Jumper as the betting martingale) are shown in the left panel of Figure 2 . Detection of concept shift becomes even more successful, and detection of label shift suffers further. In the definition of the same-class conformity measure, it is tempting to replace the distance to the nearest neighbour to the same class (the denominator of (9)) by the distance to the nearest neighbour; after all, the nearest neighbour can be expected to be of the same class. This, however, leads to slight deterioration in the final value of the red martingale (which is our primary interest); see the right panel of Figure 2 , where this modification is referred to as the nearest-object conformity measure. Of course, we do not have to use the same conformity measure when combining the red and green martingales in Figures 1 and 2 : Theorem 2 does not impose any conditions on the conformity measures giving rise to p and p . This allows us to obtain much larger final values for a valid exchangeability martingale. For example, combining the green martingale in the left panel of Figure 1 and the red martingale in the left panel of Figure 2 , we obtain an exchangeability martingale that turns 1 into what looks about 10 45 (in fact, 5.16 × 10 43 in this experiment). We have seen that the existing methods of constructing exchangeability martingales can be adapted to detecting concept shift. Perfectly decomposable exchangeability martingales turned out to be surprisingly successful on the USPS dataset of handwritten digits. This note concentrated on concept shift in Y → X classification domains. It is clear, however, that the same methods are applicable, verbatim, when the observations z i take values in any measurable space and y i are no longer the labels but defined as f (z i ) for a function f taking finitely many values. For example, y i can be an important feature of the object in z i that we do not wish to model, but we wish our analysis to be conditional on it (e.g., y i ∈ {male, female} can be a feature). uniformly in [0, 1] 2N (even conditionally on Z 1 , . . . , Z n ). This can be demonstrated using the following backward argument. Ignoring borderline effects, P N is uniformly distributed in [0, 1] (at least approximately). When Y N is disclosed, P N will be settled. Given what we already know, the distribution of P N will be uniform. When X N is disclosed, P N will be settled. Now the distribution of P N −1 given what we already know is uniform, etc. For the formal proof, we will need the following σ-algebras. Let G n , n = 0, . . . , N , be the σ-algebra G n := σ Z 1 , . . . , Z n , Z n+1 , τ n+1 , τ n+1 , . . . , Z N , τ N , τ N generated by the multiset Z 1 , . . . , Z n and the other random elements listed in the parentheses. Let G n , n = 1, . . . , N , be the σ-algebra σ(G n , Y n , τ n ) generated by G n , the label Y n of the nth observation, and the random number τ n . The following two lemmas (analogues of [14, Lemma 8.8 ]) say that is a stochastic sequence essentially in the usual sense of probability theory [12, Section 7.1.2]: in the second row we have a finite filtration, and the random variables in the first row are measurable w.r. to the σ-algebras directly below them. Lemma 4. For any trial n = 1, . . . , N , P n is G n -measurable. Proof. The random multiset of conformity scores of Z 1 , . . . , Z n is G n -measurable, and so, according to the definition (8) and the invariance requirement (7), P n is G n -measurable. Lemma 5. For any trial n = 1, . . . , N , P n is G n−1 -measurable. Proof. This follows from the definition (2) and our requirement that the labelconditional conformal transducer p should be simple. We will also need the following analogues of [14, Lemma 8.7 ]. As in [14] , E F stands for the conditional expectation w.r. to a σ-algebra F. Proof. Follow the proof of [14, Lemma 8.7 ]. Let us now prove the following double sequence of equalities: P G n P n ≤ n , P n−1 ≤ n−1 , P n−1 ≤ n−1 , . . . , P 1 ≤ 1 , P 1 ≤ 1 = n n−1 n−1 . . . 1 1 (11) and P Gn {P n ≤ n , P n ≤ n , . . . , P 1 ≤ 1 , P 1 ≤ 1 } = n n . . . 1 1 . We will use induction arranging these equalities into a single sequence: the equality for P G 1 , the equality for P G1 , the equality for P G 2 , the equality for P G2 , etc. The first of these equalities is a special case of Lemma 6. When proving any other of these equalities, we will assume that all the previous equalities are true. The equality for P Gn , n ∈ {1, . . . , N }, follows from P Gn {P n ≤ n , P n ≤ n , . . . , P 1 ≤ 1 , P 1 ≤ 1 } = E Gn E G n 1 P n ≤ n 1 Pn≤ n . . . 1 P 1 ≤ 1 1 P1≤ 1 = E Gn 1 P n ≤ n E G n 1 Pn≤ n . . . 1 P 1 ≤ 1 1 P1≤ 1 = E Gn 1 P n ≤ n n . . . 1 1 = n n . . . 1 1 . The first equality is just the tower property of conditional expectations. The second equality follows from Lemma 4. The third equality follows from the inductive assumption, namely (11) . The last equality follows from Lemma 7. The equality for P G n , n ∈ {2, . . . , N }, follows from P G n P n ≤ n , P n−1 ≤ n−1 , P n−1 ≤ n−1 , . . . , P 1 ≤ 1 , P 1 ≤ 1 = E G n E Gn−1 1 Pn≤ n 1 P n−1 ≤ n−1 1 Pn−1≤ n−1 . . . 1 P 1 ≤ 1 1 P1≤ 1 = E G n 1 Pn≤ n E Gn−1 1 P n−1 ≤ n−1 1 Pn−1≤ n−1 . . . 1 P 1 ≤ 1 1 P1≤ 1 = E G n 1 Pn≤ n n−1 n−1 . . . 1 1 = n n−1 n−1 . . . 1 1 . Now the second equality follows from Lemma 5. The third equality follows from the inductive assumption, namely (12) with n−1 in place of n. The last equality follows from Lemma 6. Plugging n := N into (12), we obtain P {P 1 ≤ 1 , P 1 ≤ 1 , . . . , P N ≤ N , P N ≤ N } = 1 1 . . . N N . This implies the uniform distribution of (P 1 , P 1 , . . . , P N , P N ) in [0, 1] 2N (see, e.g., [11, Lemma 2.2.3] ). Let the simple label-conditional conformal martingale be S n = F (P 1 , . . . , P n ), n = 0, 1, . . . , Conformal Prediction for Reliable Machine Learning: Theory, Adaptations, and Applications Sur la condition d'équivalence partielle A response to Webb and Ting's On the application of ROC analysis to predict classification performance under varying class distributions Studies in Inductive Logic and Probability NEP 19-random number generator policy A unifying view on dataset shift in classification Dataset Shift in Machine Learning Pattern recognition for conditionally independent data The language of betting as a strategy for statistical and scientific communication Game-Theoretic Foundations for Probability and Finance Probability-2 Testing randomness online Algorithmic Learning in a Random World On the application of ROC analysis to predict classification performance under varying class distributions where F and F are betting martingales and P 1 , P 1 , P 2 , P 2 , . . . is a stream of p-values as in Theorem 2 Pn−1 Pn−1 Pn−1 Pn−1,P n−1 (S n ) S n−1 = S n−1 S n−1 , where each lower index for E signifies the conditioning σ-algebra (namely, the conditioning σ-algebra is generated by the listed random variables). The third and last equalities follow from This work has been supported by Amazon (project "Conformal martingales for change-point detection") and Stena Line.