key: cord-0131035-ut9e82wd
authors: Liu, Feng; Xu, Wenkai; Lu, Jie; Sutherland, Danica J.
title: Meta Two-Sample Testing: Learning Kernels for Testing with Limited Data
date: 2021-06-14
journal: nan
DOI: nan
sha: 6b32180f7849dcd5a9f79d7cd0f58a2068d81c32
doc_id: 131035
cord_uid: ut9e82wd

Modern kernel-based two-sample tests have shown great success in distinguishing complex, high-dimensional distributions with appropriate learned kernels. Previous work has demonstrated that this kernel learning procedure succeeds, assuming a considerable number of observed samples from each distribution. In realistic scenarios with very limited numbers of data samples, however, it can be challenging to identify a kernel powerful enough to distinguish complex distributions. We address this issue by introducing the problem of meta two-sample testing (M2ST), which aims to exploit (abundant) auxiliary data on related tasks to find an algorithm that can quickly identify a powerful test on new target tasks. We propose two specific algorithms for this task: a generic scheme which improves over baselines and a more tailored approach which performs even better. We provide both theoretical justification and empirical evidence that our proposed meta-testing schemes out-perform learning kernel-based tests directly from scarce observations, and identify when such schemes will be successful.

Two-sample tests ask, "given samples from each, are these two populations the same?" For instance, one might wish to know whether a treatment and control group differ. With very low-dimensional data and/or strong parametric assumptions, methods such as t-tests or Kolmogorov-Smirnov tests are widespread. Recent work in statistics and machine learning has sought tests that cover situations not well-handled by these classic methods [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] , providing tools useful in machine learning for domain adaptation, causal discovery, generative modeling, fairness, adversarial learning, and more [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] . Perhaps the most powerful known widely-applicable scheme is based on a kernel method known as the maximum mean discrepancy (MMD) [1] -or, equivalently [35] , the energy distance [3] -when one learns an appropriate kernel for the task at hand [10, 16] . Here, one divides the observed data into "training" and "testing" splits, identifies a kernel on the training data by maximizing a power criterionĴ, then runs an MMD test on the testing data (as illustrated in Figure 1a ). This method generally works very well when enough data is available for both training and testing.

In real-world scenarios, however, two-sample testing tasks can be challenging if we do not have very many data observations. For example, in medical imaging, we might face two small datasets of lung Figure 1 : Comparison among (a) traditional kernel learning [10, 16] , (b) meta kernel learning, and (c) meta multi-kernel learning for kernel two-sample testing, wherek ork i are the learned kernel.

computed tomography (CT) scans of patients with coronavirus diseases, and wish to know if these patients are affected in different ways. If they are from different distributions, the virus causing the disease may have mutated. Here, previous tests are likely to be relatively ineffective; we cannot learn a powerful kernel to distinguish such complex distributions with only a few observations.

In this paper, we address this issue by considering a problem setting where related testing tasks are available. We use those related tasks to identify a kernel selection algorithm. Specifically, instead of using a fixed algorithm A to learn a kernel ("maximizeĴ among this class of deep kernels"), we want to learn an algorithm A θ from auxiliary data ( Figure 1b) :

Here P, Q are distributions sampled from a meta-distribution of related tasks τ , which asks us to distinguish P from Q. The corresponding observed sample sets S P , S Q are split into training (tr) and testing (te) components. A θ is a kernel selection algorithm which, given the two training sets (also called "support sets" in meta-learning parlance), returns a kernel function.Ĵ finally estimates the power of that kernel using the test set ("query sets"). In analogy with meta-learning [36] [37] [38] [39] [40] [41] [42] , we call this learning procedure meta kernel learning (Meta-KL). We can then apply A θ to select a kernel on our actual testing task, then finally run an MMD test as before (Figure 1b ).

The adaptation performed by A θ , however, might still be very difficult to achieve with few training observations; even the best A θ found by a generic adaptation scheme might over-fit to S tr P , S tr Q . For more stable procedures and, in our experiments, more powerful final tests, we propose meta multi-kernel learning (Meta-MKL). This algorithm independently finds the most powerful kernel for each related task; at adaptation time, we select a convex combination of those kernels for testing (Figure 1c ), as in standard multiple kernel learning [43] and similarly to ensemble methods in few-shot classification [44, 45] . Because we are only learning a small number of weights rather than all of the parameters of a deep network, this adaptation can quickly find a high-quality kernel.

We provide both theoretical and empirical evidence that Meta-MKL is better than generic Meta-KL, and that both outperform approaches that do not use related task data, in low-data regimes. We find that learned algorithms can output kernels with high test power using only a few samples, where "plain" kernel learning techniques entirely fail.

We will now review the setting of learning a kernel for a two-sample test, following [16] . Let X ⊂ R d and P, Q be (unknown) Borel probability measures on X , with S P = {x i } m i=1 ∼ P m and S Q = {y j } m j=1 ∼ Q m observed i.i.d. samples from these distributions. We operate in a classical hypothesis testing setup, with the null hypothesis that P = Q.

Maximum mean discrepancy (MMD). The basic tool we use is a kernel-based distance metric between distributions called the MMD, defined as follows. (The energy distance [3] is a special case of the MMD for a particular choice of k [35] .) Definition 1 (MMD [1] ). Let k : X × X → R be the bounded 1 kernel of an RKHS H k (i.e., sup x,y∈X |k(x, y)| < ∞). Letting X, X ∼ P and Y, Y ∼ Q be independent random variables,

If k is characteristic, we have that MMD(P, Q; k) = 0 if and only if P = Q.

We can estimate MMD using the following U -statistic estimator, which is unbiased for MMD 2 (denoted by MMD 2 u ) and has nearly minimal variance among unbiased estimators [1] :

Testing. Under the null hypothesis H 0 , m MMD 2 u converges in distribution as m → ∞ to some distribution depending on P and k [1, Theorem 12] . We can thus build a test with p-value equal to the quantile of our test statistic m MMD 2 u under this distribution. Although there are several methods to estimate this null distribution, it is usually considered best [10] to use a permutation test [46, 47] : under H 0 , samples from P and Q are interchangeable, and repeatedly re-computing the statistic with samples randomly shuffled between S P and S Q estimates its null distribution.

Test power. We generally want to find tests likely to reject H 0 when indeed it holds that P = Q; the probability of doing so (for a particular P, Q, k and m) is called power. For reasonably large m, Sutherland et al. [10] and Liu et al. [16] argue that the power is an almost-monotonic function of

Here, σ 2 H1 is the asymptotic variance of √ m MMD 2 u under H 1 ; it is defined in terms of an expectation of (3) with respect to the data samples S P , S Q , for i, j, distinct. The criterion (4) depends on the unknown distributions; we can estimate it from samples with the regularized estimator [16] J λ (S P , S Q ; k) := MMD 2 u (S P , S Q ; k) σ 2 H1 (S P , S Q ; k) + λ,

Kernel choice. Given two samples S P and S Q , the best kernel is (essentially) the one that maximizes J in (4). If we pick a kernel to maximize our estimateĴ using the same data that we use for testing, though, we will "overfit," and reject H 0 far too often. Instead, we use data splitting [2, 5, 10] : we partition the samples into two disjoint sets, S P = S tr P ∪ S te P , obtain k tr = A(S tr P , S tr Q ) ≈ arg max kĴλ (S tr P , S tr Q ; k), then conduct a permutation test based on MMD(S te P , S te Q ; k tr ). This process is summarized in Algorithm 2 and illustrated in Figure 1a .

This procedure has been successfully used not only to, e.g., pick the best bandwidth for a simple Gaussian kernel, but even to learn all the parameters of a kernel like (8) which incorporates a deep network architecture [10, 16] . As argued by Liu et al. [16] , classifier two-sample tests [8, 48] (which test based on the accuracy of a classifier distinguishing P from Q) are also essentially a special case of this framework -and more-general deep kernel MMD tests tend to work better. Although presented here specifically for MMD u , an analogous procedure has been used for many other problems, including other estimates of the MMD and closely-related quantities [2, 5, 49] .

When data splitting, the training split must be big enough to identify a good kernel; with too few training samples m tr ,Ĵ λ will be a poor estimator, and the kernel will overfit. The testing split, however, must also be big enough: for a given P, Q, and k, it becomes much easier to be confident that MMD(P, Q; k) > 0 as m te grows and the variance in MMD u (P, Q; k) accordingly decreases. When the number of available samples is small, both steps suffer. This work seeks methods where, by using related testing tasks, we can identify a good kernel with an extremely small m tr ; thus we can reserve most of the available samples for testing, and overall achieve a more powerful test.

Another class of techniques for kernel selection avoiding the need for data splitting is based on selective inference [15] . At least as studied by Kübler et al. [15] , however, it is currently available only for restricted classes of kernels and with far-less-accurate "streaming" estimates of the MMD, which for fixed kernels can yield far less powerful tests than MMD u [50] . In Section 5.4, we will demonstrate that in our settings, the data-splitting approach is empirically much more powerful.

To handle cases with low numbers of available data samples, we consider a problem setting where related testing tasks are available. We use those tasks in a framework inspired by meta-learning [e.g. 36], where we use those related tasks to identify a kernel selection algorithm, as in (1). Specifically, we define a task as a pair T = (P, Q) of distributions over X we would like to distinguish, and assume a meta-distribution τ over the space of tasks T . Definition 2 (M2ST). Assume we are assigned (unobserved) training tasks

drawn from a task distribution τ , and observe meta-

Our goal is to use these meta-samples to find a kernel learning algorithm A θ , such that for a target task (P , Q ) ∼ τ with samples S P ∼ (P ) n and S Q ∼ (Q ) n , the learning algorithm returns a kernel A θ (S P , S Q ) which will achieve high test power on (P , Q ).

We measure the performance of A θ based on the expected test power criterion for a target task:

If τ were in some sense "uniform over all conceivable tasks," then a no-free-lunch property would cause M2ST to be hopeless. Instead, our assumption is that tasks from τ are "related" enough that we can make progress at improving (7) .

By assuming the existence of a meta-distribution τ over tasks, it is promising to learn a general rule across different tasks [36] . Furthermore, we can quickly adapt to a solution to a specific task based on the learned rule [36] , which is the reason why researchers have been focusing on meta-learning for several years. Specifically, in the meta two-sample testing, we hope to find "what differences between distributions generally look like" on the meta-tasks, and then at test time, use our very limited data to search for differences in that more constrained set of options. Because the space of candidate rules is more limited, we can (hopefully) find a good rule with a much smaller number of data points. For example, if we have meta-tasks which (like the target task) are determined only by a difference in means, then we want to learn a general rule that distinguishes between two samples by means. For a new task (i.e., new two samples), we hope to identify dimensions where two samples have different means, using only a few data points.

We will propose two approaches to finding an A θ . Neither is specific to any particular kernel parameterization, but for the sake of concreteness, we follow Liu et al. [16] in choosing the form

where φ is a deep neural network which extracts features from the samples, and κ is a simple kernel (e.g., a Gaussian) on those features, while q is a simple characteristic kernel (e.g. Gaussian) on the input space; ∈ (0, 1] ensures that every kernel of the form k ω is characteristic. Here, ω represents all parameters in the kernel: 2 most parameters come from deep neural network φ, but κ and q may have a few more parameters (e.g. length scales), and we can also learn .

Meta-KL. We first propose Algorithm 1 as a standard approach to optimizing (7), à la MAML [36] : A θ takes a small, fixed number of gradient ascent steps inĴ λ for the parameters of k ω , starting

; kernel architecture (8) and parameters ω0; regularization λ 1. Initialize algorithm parameters: θ := [ωstart] ← [ω0] 2. Define a parameterized learning algorithm A θ (S P , S Q ) as:

ω ← ωstart; for t = 1, . . . , nsteps do ω ← ω + η∇ωĴ λ (S P , S Q ; kω); end for; return kω for T = 1, 2, . . . , Tmax do 3: Sample I as a set of indices in {1, 2, . . . , N } of size n batch for i ∈ I do 4: Split data as

Output Meta-Updates Algorithm 2 Testing with a Kernel Learner 1: Input: Two samples:

; kernel architecture as in (8) for i = 1, 2, . . . , N do 1: Optimize ωi ← arg max ωĴ λ (S P i , S Q i ; ki) with some approximate optimization algorithm; end for 2:

# convex combinations of kernels 3: return the algorithm A(S P , S Q ) = arg max k∈KĴ λ (S P , S Q ; k); from a learned initialization point ω start ∈ θ (lines 1-2). We differentiate through A θ , and perform stochastic gradient ascent to find a good value of θ based on the meta-training sets (lines 3-6, also illustrated in Figure 2 ). Once we have learned a kernel selection procedure, we can again apply it to a testing task with Algorithm 2.

As we will see in the experiments, this approach does indeed use the meta-tasks to improve performance on target tasks. Differently from usual meta-learning settings, as in e.g. classification [36] , however, here it is conceivable that there is a single good kernel that works for all tasks from τ ; improving on this single baseline kernel, rather than simply overfitting to the very few target points, may be quite difficult. Thus, in practice, the amount of adaptation that A θ actually performs in its gradient ascent can be somewhat limited.

As an alternative approach, we also consider a different strategy for A θ which may be able to adapt with many fewer data samples, albeit in a possibly weaker class of candidate kernels. Here, to select an A θ , we simply find the best kernel independently for each of the meta-training tasks. Then A θ chooses the best convex combination of these kernels, as in classical multiple kernel learning [43] and similarly to ensemble methods in few-shot classification [45] . At adaptation time, we only attempt to learn N weights, rather than adapting all of the parameters of a deep network; but, if the meta-training tasks contained some similar tasks to the target task, then we should be able to find a powerful test. This procedure is detailed in Algorithm 3 and illustrated in Figure 1c .

We now analyze and compare the theoretical performance of direct optimizing the regularized test power from small sample size with our proposed meta-training procedures. To study our learning objective of approximate test power, we first state the following relevant technical assumptions, [16] .

(A) The kernels k ω are uniformly bounded as follows. For the kernels we use in practice, ν = 1.

(B) The possible kernel parameters ω lie in a Banach space of dimension D. Furthermore, the set of possible kernel parameters Ω is bounded by

The kernel parameterization is Lipschitz: for all x, y ∈ X and ω, ω ∈ Ω,

See Proposition 9 of Liu et al. [16] for bounds on these constants when using e.g. kernels of the form (8) , in terms of the network architecture.

We will use σ 2 ω to refer to σ 2 H1 (P, Q; k ω ) of (4), and analogously forσ 2 ω , for the sake of brevity. Proposition 3 (Direct training with approximate test power, Theorem 6 of [16] ). Under Assumptions (A) to (C), suppose ν ≥ 1 is constant, and letΩ s ⊆ Ω be the set of ω for which σ 2 ω ≥ s 2 . Take the regularized estimateσ 2 ω,λ =σ 2 ω + λ with λ = m −1/3 . Then, with probability at least 1 − δ,

Since m is small in our settings, and s may also be small for deep kernel classes as noted by Liu et al. [16] , this bound may not give satisfying results.

The key mechanism that drives meta-testing to work, intuitively, is training kernels on related tasks. How do we quantify the relatedness between different testing tasks? Definition 4 (γ-relatedness). Let (P, Q) and (P , Q ) be the underlying distributions for two different two-sample testing tasks. We say the two tasks are γ-related w.r.t. learning objective J if

The relatedness measure is a (strong) assumption that two tasks are similar, because all kernels perform similarly on the two tasks, in terms of the approximate test power objective. It also implies the two problems are of similar difficulty, since for small γ, the ability of our MMD test statistics to distinguish the distributions (with optimal kernels) are similar. Definition 5 (Adaptation with Meta-MKL). Given a set of kernels

This adaptation step uses the same learning objective,Ĵ λ , as directly training a deep kernel in Proposition 3 (though with a potentially different regularization parameter, λ ne ).

To analyze the Meta-MKL scheme, we will make the following assumption, which Proposition 26 in Liu et al. [16] shows implies Assumptions

be a set of base kernels, each satisfying sup x∈X k i (x, x) ≤ K for some finite constant K. Define the parameterized kernel k as

where β ∈ R N , and let B be the set of parameters β such that k β is positive semi-definite (guaranteed if each β i ≥ 0) and β ≤ R B for some R B < ∞.

, with corresponding optimal kernels k * i ∈ arg max k∈K J(P i , Q i ; k), and use n samples to learn kernelsk i ∈ arg max k∈KĴλ (S P i , S Q i ; k) in the setting of Proposition 3. Let (P, Q) be a test task, with optimal kernel k * ∈ arg max k∈K J(P, Q; k), from which we observe m samples S P , S Q . Call the Meta-MKL adapted kernelkβ = iβ iki , as in (10), withβ found subject to Assumption (D). Let (P j , Q j ) be a meta-task which is γ-related to (P, Q). Then, with probability at least 1 − 2δ,

n is the bound of Proposition 3 for learning a kernel on (P j , Q j ), and ξ m is the equivalent bound for multiple kernel learning on (P, Q), which has D = N ,

The ξ j n term depends on the meta-training sample size n m. With enough (relevant) meta-training tasks (as N grows), γ is expected to go to 0. So, the overall uniform convergence bound is likely to be dominated by the term ξ m , giving an overall m −1/3 rate: the same as Proposition 3 obtains for directly training a deep kernel only on (P, Q). This is roughly to be expected; similar optimization objectives are applied for both learning and adaptation, which are limited by sample size m. However, the other components of ξ m are likely much smaller than the corresponding parts of ξ j n where the kernels are defined by a deep network: the variance must be lower-bounded over a much larger set of kernels, D will be the number of parameters in the network rather than the number of meta-tasks, and the bound on L k from Proposition 23 of Liu et al. [16] is exponential in the depth of the network. Altogether, we expect MKL adaptation to be much more efficient than direct training.

We also expect that Theorem 6 is actually quite loose. The proof (in Appendix A) decomposes the loss relative tok j , picking just a single related kernel; it does not attempt to analyze how combining multiple kernels can improve the test power, because doing so in general seems difficult. Given this limitation, however, we also prove in Appendix A a bound on the adaptation scheme which explicitly only picks the single best kernel from the meta-tasks (Theorem 10), which is of a similar form to Theorem 6 but with the ξ m term replaced with one even better.

Following Liu et al. [16] , we compare the following baseline tests with our methods: 1) MMD-D: MMD with a deep kernel whose parameters are optimized; 2) MMD-O: MMD with a Gaussian kernel whose lengthscale is optimized; 3) Mean embedding (ME) test [5, 51] ; 4) Smooth characteristic functions (SCF) test [5, 51] , and 5) Classifier two-sample tests, including C2ST-S [8] and C2ST-L as described in Liu et al. [16] . None of these methods use related tasks at all, so we additionally consider an aggregated kernel learning (AGT-KL) method, which optimizes a deep kernel of the form (8) by maximizing the value ofĴ λ averaged over all the related tasks in the meta-training set S.

For synthetic datasets, we take a single sample set for S tr P and S tr Q , and learn parameters once for each method on that training set. We then evaluate its rejection rate (power or Type-I error, depending on if P = Q) using 100 new sample sets S te P , S te Q . For real datasets, we train on a subset of the available data, then evaluate on 100 random subsets, disjoint from the training set, of the remaining data. We repeat this full process 20 times for synthetic datasets or 10 times for real datasets, and report the mean rejection rate of each test and the standard error of the mean rejection rate. Implementation details are in Appendix B.1; the code is available at github.com/fengliu90/MetaTesting.

We use a bimodal Gaussian mixture dataset proposed by [16] , known as high-dimensional Gaussian mixtures (HDGM): P and Q subtly differ in the covariance of a single dimension pair. Here we consider only d = 2, since with very few samples the problem is already extremely difficult. Specifically, In this paper, our target task is T = (P, Q(0.7)) and meta-samples are drawn from the N = 100 meta tasks T = {T i := (P, Q(0.3 + 0.1 × i/N ))} N i=1 ; note that the target task is well outside the scope of training tasks. To evaluate all tests given limited data, we set the number of training samples (S tr P , S tr Q ) to 50, 100, 150 per mode, and the number of testing samples (S te P , S te Q ) from 50 to 250. Figure 3 illustrates test powers of all tests. Meta-MKL and Meta-KL are the clear winners, with both tests much better when m te is over 100 per mode. It is clear that previous kernel-learning based tests perform poorly due to limited training samples. Comparing Meta-MKL with Meta-KL, apparently, we can obtain much higher power when we consider using multiple trained kernels. Although AGT-KL performs better than baselines, it cannot adapt to the target task very well: it only cares about "in-task" samples, rather than learning to adapt to new distributions. In Appendix B.2, we report the test power of our tests when increasing the number of tasks N from 20 to 150. The results show that that increasing the number of meta tasks will help improve the test power on the target task. We distinguish the standard datasets of CIFAR-10 and CIFAR-100 [52] from the attempted replication CIFAR-10.1 [53] , similar to Liu et al. [16] . Because only a relatively small number of CIFAR-10.1 samples are available, it is of interest to see whether by meta-training only on CIFAR-10's training set (as described in Appendix B), we can find a good test to distinguish CIFAR-10.1, with m tr ∈ {100, 200}. Testing samples (i.e., S te P and S te Q ) are from test sets of each dataset. We report test powers of all tests with 200, 500, 900 testing samples in Table 1 (CIFAR-10 compared to CIFAR-10.1) and Table 2 (CIFAR-100 compared to CIFAR-10.1). Since Liu et al. [16] have shown that CIFAR-10 and CIFAR-10.1 come from different distributions, higher test power is better in both tables. The results demonstrate that our methods have much higher test power than baselines, which is strong evidence that leveraging samples from related tasks can boost test power significantly. Interestingly, C2ST tests almost entirely fail in this setting (as also seen by Recht ; it is hard to learn useful information with only a few data points. In Appendix B.3, we also report results when meta-samples are generated by the training set of CIFAR-100 dataset.

This subsection studies how closeness between related tasks and the target task affects test powers of our tests. Given the target task T in synthetic datasets, we define tasks T with closeness C as

It is clear that T (0) will contain our target task T (i.e., the closeness is zero). We also estimate the γ-relatedness between the target task and In Figure 4 , we illustrate the test power of our tests when setting closeness C to 0.1, 0.2 and 0.3, respectively. It can be seen that Meta-MKL and Meta-KL outperforms AGT-KL all the figures, meaning that Meta-MKL and Meta-KL actually learn algorithms that can quickly adapt to new tasks. Another phenomenon is that the gap between test powers of meta based KL and AGT-KL will get smaller if the closeness is smaller, which is expected since AGT-KL has seen closer related tasks.

In previous sections, we mainly compare with previous kernel-learning tests and have shown that the test power can be improved significantly by our proposed tests. We now show that each component in our tests is effective to improve the test power.

First, we show that MKL can help improve the test power, and the data splitting used in Meta-MKL is much better than using the recent test of Kübler et al. [15] . The comparison has been made in synthetic datasets studied in Section 5.1 and the results can be found in Table 3 . Meta-MKL-A is a test that takes all β i = 1/N in Algorithm 3, so that kernels are weighted equally. AGT-MKL uses multiple kernels in AGT-KL (learning weights like Meta-MKL), and AGT-MKL-A does not learn the weights but just assigns weights 1/N directly to all base kernels. Meta-MKL SI is a kernel two-sample test using the selective inference technique of Kübler et al. [15] rather than data splitting in its A θ .

From Table 3 , we can see that introducing the multiple kernel learning (MKL) scheme substantially improves test power, as it combines useful features learned from base kernels covering different aspects of the problems. Moreover, learning with approximate test power with data-splitting in the meta-setting also outperforms the non-splitting testing procedure MetaMKL SI , since MetaMKL SI requires a linear estimator of MMD. The result also indicates that leveraging related tasks is also important to improve the test power, even though we only need a small set of training samples.

Then, we show that the labels used for constructing meta-samples are useful in the CIFAR datasets. We consider another test here: MMD-D with all CIFAR-10 (MMD-D w/ AC), which runs the MMD-D test using the same sample from CIFAR-10 as did the meta-learning over all tasks together. Compared to Meta-MKL, Meta-KL and AGT-KL, MMD-D w/ AC does not use the label information contained in the CIFAR-10 dataset. The test power of MMD-D w/ AC is shown in Table 4 . We can see that the test power of MMD-D w/ AC clearly outperforms MMD-D/MMD-O since MMD-D w/ AC sees more CIFAR-10 data in the training process. It is also clear that our methods still perform much better than MMD-D w/ AC. This result shows that the improvement of our tests does not solely come from seeing more data from CIFAR-10. Instead, the assigned labels for the meta-tasks indeed help.

This paper proposes kernel-based non-parametric testing procedures to tackle practical two-sample problems where the sample size is small. By meta-training on related tasks, our work opens a new paradigm of applying learning-to-learn schemes for testing problems, and the potential of very accurate tests in some small-data regimes using our proposed algorithms.

It is worth noting, however, that statistical tests are perhaps particularly ripe for mis-application, e.g. by over-interpreting small marginal differences between sample populations of people to claim "inherent" differences between large groups. Future work focusing on reliable notions of interpretability in these types of tests is critical. Meta-testing procedures, although they yield much better tests in our domains, may also introduce issues of their own: any rejection of the null hypothesis will be statistically valid, but they favor identifying differences similar to those seen before, and so may worsen gaps in performance between "well-represented" differences and rarer ones.

As a "warm-up" and because it is of independent interest, we will first study an adaptation algorithm which picks the single best kernel from the meta tasks: Definition 7 (Adaptation by choosing-one-best kernel). With the set of base kernels {k 1 , . . . , k N }, k = arg max iĴλ ne (S tr P , S tr Q ; k i ) is said to be the best kernel adaptation.

Proposition 3 shows uniform convergence ofĴ λ for direct adaptation of a kernel class, whether a deep kernel or multiple kernel learning. For our analysis of choosing the best single kernel, however, we only need uniform convergence over a finite set, where we can obtain a slightly better rate. Lemma 8 (Generalization gap for choosing-one-best kernel adaptation). Let k i be a set of base kernels, whose power criteria on the corresponding distributions are J i = J(P, Q; k i ), and let s = min i∈[N ] σ 2 H1 (P, Q; k i ). Denote the regularized estimates of these values asĴ i =Ĵ λ (S P , S Q ; k i ), where |S P | = |S Q | = m and λ = m −1/3 . Then, with probability at least 1 − δ,

Proof. To bound max i∈[N ] |Ĵ i − J i |, we consider high-probability bounds for concentration ofη ω and σ 2 ω with McDiarmid's inequality and a union bound, as developed within the proofs of Propositions 15 and 16 of Liu et al. [16] . With probability at least 1 − δ, we have

Then, taking σ 2 i,λ = σ 2 i + λ, we can decompose the worst-case generalization error as

Taking the upper bound on the kernel to be constant, in our case ν = 1, the above equation reads

Taking the regularizer λ = m −1/3 to achieve the best overall rate,

Since the adaptation step is based on m samples from the actual testing task, our generalization result is derived based on the sample size m. As explained in the main text, even though the sample size is still small, the adaptation result benefits from a much better trained base kernel set, giving rise to large s compared to s from directly training from the deep kernel parameters with m samples.

Given this building block, we proceed to state and prove the choosing-one-best kernel adaptation, Theorem 10. Lemma 9. Let (P, Q) and (P i , Q i ) be two testing tasks which are γ-related (Definition 4), and let k * ∈ arg max k∈K J(P, Q; k) and k * i ∈ arg max k∈K J(P i , Q i ; k). Then

Proof. We know that J(P i , Q i ; k * ) ≤ J(P i , Q i ; k * i ) by the definition of k * i , and that J(P i , Q i ; k * ) ≥ J(P, Q; k * ) − γ by γ-relatedness. Putting together we have

. Theorem 10 (Adaptation by choosing one best base kernel). Suppose we have N meta-training tasks

, each with corresponding optimal kernels k * i ∈ arg max k∈K J(P i , Q i ; k), and learn kernelsk i ∈ arg max k∈KĴλ (S P i , S Q i ; k) based on n samples in the setting of Proposition 3. Let (P, Q) be a test task from which we observe m samples S P , S Q . Let j be the index of a task (P j , Q j ) which is γ-related to (P, Q). Then, with probability at least 1 − 2δ,

n is the bound of Proposition 3 for (P j , Q j ), while ζ m is the bound of Lemma 8 for (P, Q).

Proof. We will assume that (S P , S Q ) satisfies the uniform convergence condition of Lemma 8, and (S P j , S Q j ) that of Proposition 3, which happens with probability at least 1 − 2δ. We use the decomposition

+Ĵ λ (S P , S Q ;k) − J λ (S P , S Q ;k) .

Term (i) is identical to terms (a) through (c) of Theorem 10, and is upper-bounded by 2(γ + ξ j n ) conditional only on the convergence event for (S P j , S Q j ). Term (ii) is at most 0, sincek j corresponds to choosing the jth standard unit vector for β, so β * is at least as good as that choice of β. Finally, term (iii) is covered by Proposition 3, as in Proposition 8 of Liu et al. [16] , giving an upper bound with probability 1 − δ on (S P , S Q ). Figure 5 shows samples from CIFAR-10 and CIFAR-10.1.

CIFAR-10.1 is available from https://github.com/modestyachts/CIFAR-10.1/tree/master/datasets (we use cifar10.1_v4_data.npy). This new test set contains 2, 031 images from TinyImages [54] .

We implement all methods with Pytorch 1.1 (Python 3.8) using an NIVIDIA Quadro RTX 8000 GPU, and set up our experiments according to the protocol proposed by Liu et al. [16] . In the following, we demonstrate our configurations in detail. We run ME and SCF using the official code [5] , Table 3 . We set α = 0.05 for all experiments. We use a deep neural network g • φ as the classifier in C2ST-S and C2ST-L, where g is a two-layer fully-connected binary classifier, and φ is the feature extraction architecture also used in the deep kernels in MMD-D, AGT-KL, Meta-KL, Meta-MKL, and methods in Table 3 and Table 4 .

For HDGM, φ is a five-layer fully-connected neural network. The number of neurons in hidden and output layers of φ are set to 3 × d, where d is the dimension of samples. These neurons use softplus activations, log(1 + exp(x)). For CIFAR, φ is a convolutional neural network (CNN) with four convolutional layers and one fully-connected layer. The structure of the CNN follows the structure of the feature extractor in the discriminator of DCGAN [55] (see Figures 6 and 7 for the structure of φ in our tests, MMD-D, C2ST-S and C2ST-L). We randomly select data from two different classes to form the two samples (n i is 100) as meta-samples in CIFAR-10/CIFAR-100. Thus, there are C 2 10 and C 2 100 tasks when running Algorithm 1 on training sets of CIFAR-10 and CIFAR-100. For each task, we have 200 instances. Note that, for results on synthetic data, we repeat experiments 20 times to avoid the effects caused by the generation noise. DCGAN code is from https://github.com/ eriklindernoren/PyTorch-GAN/blob/master/implementations/dcgan/dcgan.py.

We use the Adam optimizer [56] to optimize network and/or kernel parameters. Hyperparameter selection for ME, SCF, C2ST-S, C2ST-L, MMD-O and MMD-D follows Liu et al. [16] . In Algorithm 1, λ is set to 10 −8 , and the update learning rate η (line 2) is set to 0.8, and the meta-update learning rate is set to 0.01. Batch size is set to 10, and the maximum number of epoch is set to 1, 000. In line 6 in Algorithm 1, we use Adam optimizer with default hyperparameters. In line 1 in Algorithm 3, we adopt Adam optimizer with default hyperparameters and set learning rate to 0.01. Besides, we use the algorithm from Algorithm 1 to initialize parameters in the optimization algorithm. To avoid the computational cost caused by the large number of meta-tasks, we randomly select 10 tasks in Meta-MKL rather than all N tasks. Meanwhile, to ensure that we can get help from all tasks, we will use the algorithms outputted by Meta-KL to optimize the deep kernels (line 1 in Algorithm 3) in the selected 10 tasks. The algorithms outputted by Meta-KL are helpful to find the best deep kernel for each task. Note that we do not use dropout.

We report the test power±standard error of Meta-KL and Meta-MKL when increasing the number of tasks N from 20 to 150 in this subsection. Tables 5 and 6 show that the test power will increase in general when increasing N from 20 to 150. When m te = 250, the lowest test power appears when N = 20 (0.333 for Meta-KL and 0.459 for Meta-MKL), and the highest test power appears when N = 150 (0.771 for Meta-KL and 0.907 for Meta-MKL). This means that increasing the number of meta tasks will help improve the test power on the target task.

B.3 Distinguishing CIFAR-10 or -100 from CIFAR-10.1 Using CIFAR-100-based Meta-tasks

In this subsection, we report results when meta-samples are generated by the training set of CIFAR-100 dataset, which are shown in Tables 7 and 8 . It can be seen that our methods still have high test powers compared to previous methods. Besides, we can get higher test power on the task CIFAR-100 vs CIFAR-10.1 compared to results in Table 2 , since meta-samples used here are closer to the target task. This phenomenon also appears in Section 5.3. 

In this subsection, we introduce how to estimate the γ-relatedness between the target task T = (P, Q) and the meta-tasks T i = (P i , Q i ). Estimation of γ-relatedness. Let S P and S P be samples drawn from P and Q, respectively, and let S P i and S P i be samples drawn from P i and Q i , respectively. Then, we split S P into S tr P ∪ S te P , and S Q into S tr Q ∪ S te Q , and S P i into S tr P i ∪ S te P i , and S Q i into S tr Q i ∪ S te Q i . Let the deep kernel k have the form (8) . Next, following Definition 4 and [16] , we find a kernel trying to achieve the maximum in γ ask = arg max k Ĵ (S tr P , S tr Q ; k) −Ĵ(S tr P i , S tr

Based on Definition 4, we can estimate the γ i between T and T i as follows.

γ i = |Ĵ(S tr P , S tr Q ;k) −Ĵ(S tr P i , S tr Q i ;k)|.

To try to avoid the local maximum during the the above maximizing process, we will repeat the above optimization procedure 10 times for estimatingγ i . Namely, we have 10 values {γ it } 10 t=1 forγ i . Hence, the estimated γ between T and {T i } N i=1 is set toγ = min i max tγit . Closeness vs γ-relatedness. Given the target task T in synthetic datasets, in this experiment, we set |S tr P | = |S te P | = |S tr Q i | = |S te Q i | = 4, 000 and define tasks T with closeness C as T (C) = {T i := (P, Q((0.6 − C) + 0.1 × i/N ))} N i=1 .

It is clear that T (0) will contain our target task T (i.e., the closeness is zero). Then, we estimate the γ-relatedness between the target task and T (C), where C ∈ {0.1, 0.2, 0.3, 0.4, 0.5}, and the results show that γ ∝ C. Specifically, if we let C be 0.1, 0.2, 0.3, 0.4, 0.5, then theγ is 0.035, 0.067, 0.076, 0.104, 0.134, respectively.

A kernel two-sample test

Optimal kernel choice for large-scale two-sample tests

Energy statistics: A class of statistics based on distances

Multivariate tests of association based on univariate tests

Interpretable distribution features with maximum testing power

A new graph-based two-sample test for multivariate and object data

Two-sample tests for large random graphs using network statistics

Revisiting Classifier Two-Sample Tests

On Wasserstein Two-Sample Testing and Related Families of Nonparametric Tests

Generative Models and Model Criticism via Optimized Maximum Mean Discrepancy

Robust Hypothesis Testing Using Wasserstein Uncertainty Sets

Practical Methods for Graph Two-Sample Testing

Fully Distributed Sequential Hypothesis Testing: Algorithms and Asymptotic Analyses

Two-sample Testing Using Deep Learning

Learning Kernel Tests Without Data Splitting

Learning Deep Kernels for Non-Parametric Two-Sample Tests

Neural Tangent Kernel Maximum Mean Discrepancy

An Optimal Witness Function for Two-Sample Testing

Domain adaptation with conditional transferable components

Demystifying MMD GANs

Accumulating regional density dissimilarity for concept drift detection in data streams

Semantic-Transferable Weakly-Supervised Endoscopic Lesions Segmentation

Data-Driven Approach to Multiple-Source Domain Adaptation

Kappa updated ensemble for drifting data stream mining

CSCL: Critical Semantic-Consistent Learning for Unsupervised Domain Adaptation

Exploiting MMD and Sinkhorn Divergences for Fair and Transferable Representation Learning

Learning Bounds for Open-Set Learning

Open Set Domain Adaptation: Theoretical Bound and Algorithm

Maximum Mean Discrepancy is Aware of Adversarial Attacks

A Segment-Based Drift Adaptation Method for Data Streams

What and How: Generalized Lifelong Spectral Clustering via Dual Memory

Drift-Surf: Stable-State/Reactive-State Learning under Concept Drift

Sequential Domain Adaptation by Synthesizing Distributionally Robust Experts

How Does the Combined Risk Affect the Performance of Unsupervised Domain Adaptation Approaches

Equivalence of distance-based and RKHS-based statistics in hypothesis testing

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks

Prototypical Networks for Few-shot Learning

Learning to propagate labels: Transductive propagation network for few-shot learning

Attribute Propagation Network for Graph Zero-shot Learning

Isometric Propagation Network for Generalized Zero-shot Learning

A Multi-Mode Modulator for Multi-Domain Few-Shot Classification

Noise Contrastive Meta-Learning for Conditional Density Estimation using Kernel Mean Embeddings

Multiple kernel learning algorithms

TAEML: Task-Adaptive Ensemble of Meta-Learners

Selecting Relevant Features from a Multi-domain Representation for Few-Shot Classification

Modified Randomization Tests for Nonparametric Hypotheses

A test for the two-sample problem based on empirical characteristic functions

Classification Accuracy as a Proxy for Two Sample Testing

A linear-time kernel goodness-of-fit test

Adaptivity and Computation-Statistics Tradeoffs for Kernel and Distance based High Dimensional Two Sample Testing

Fast twosample testing with analytic representations of probability measures

Learning Multiple Layers of Features from Tiny Images

Do ImageNet Classifiers Generalize to ImageNet?

80 Million Tiny Images: A Large Data Set for Nonparametric Object and Scene Recognition

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

Adam: A Method for Stochastic Optimization