key: cord-0144611-psvzz8pw
authors: Dai, Ran; Zheng, Cheng
title: Multiple Testing for Composite Null with FDR Control Guarantee
date: 2021-06-24
journal: nan
DOI: nan
sha: 8697fd5a5364340d80319780009d82a50486060a
doc_id: 144611
cord_uid: psvzz8pw

False discovery rate (FDR) controlling procedures provide important statistical guarantees for reproducibility in signal identification experiments with multiple hypotheses testing. In many recent applications, the same set of candidate features are studied in multiple independent experiments. For example, experiments repeated at different facilities and with different cohorts, and association studies with the same candidate features but different outcomes of interest. These studies provide us opportunities to identify signals by considering the experiments jointly. We study the question of how to provide reproducibility guarantees when we test composite null hypotheses on multiple features. Specifically, we test the unions of the null hypotheses from multiple experiments. We present a knockoff-based variable selection method to identify mutual signals from multiple independent experiments, with a finite sample size FDR control guarantee. We demonstrate the performance of this method with numerical studies and applications in analyzing crime data and TCGA data.

Multiple testing has become a prevailing subject in many scientific areas, where multiple hypotheses are tested at once in a single experiment. It is very useful when the candidate feature collection is large and the signals are relatively rare. One important challenge is in providing some reproducibility guarantees while still keeping the power in selecting the sparse signals. During the last several decades, a new measure for type-I error, namely False Discovery Rate (FDR) (Benjamini & Hochberg, 1995) opened a new area in multiple hypothesis testing, and methods controlling FDR have been developed rapidly ever since. Powerful knockoff-based methods have been developed for exact FDR control in feature selection problems in linear (Barber & Candès, 2015) and nonlinear models (Candès, Fan, Janson, & Lv, 2018 ) with very mild model assumptions. Variants of knockoff methods have become useful tools in scientific research. For example, for genetic studies aiming at detecting the variations across the whole genome during a certain disease process to identify the variations associated with the disease, Sesia, Sabatti, and Candès (2018) developed hidden Markov model knockoffs for FDR controls in genome-wide association studies (GWAS).

With recent advances in data collection, the same set of candidate features are often studied in multiple independent experiments. For example, some biological experiments are repeated at different facilities, with different devices, or at different times. In sociology studies, the same experiments are done with populations from different social groups. In GWAS studies, associations of genome features with multiple different outcomes of interests are studied in multiple experiments. These provide us opportunities to identify signals by testing hypotheses on the same features jointly. Although most of the literature focuses on questions based on feature selection from a single regression problem; in real applications, there is a need for selecting features from multiple experiments jointly. One direction of ideas in considering the feature selection information from multiple experiments is to aggregate the information to improve the selection power. Along this direction, group and multitask knockoff (Dai & Barber, 2016) , and prototype group knockoff (Chen, Hou, & Hou, 2019) have been proposed. These methods select features that are signals from at least one of the experiments; in other words, they test the composite nulls that are the intersections of the nulls from the individual experiments. There is another direction of utilizing the information from the multiple experiments jointly, where only features that are signals in all the experiments are selected. In other words, we are testing the composite nulls that are unions of the nulls from the individual experiments. This type of test is of potential interest in multiple applications, for example, reproducible research, mutual feature identification from heterogeneous populations, and mediation analysis. We will elaborate on these applications below in section 1.1.

In this paper, we present a knockoff-based method with exact FDR control for feature selection based on hypotheses with composite nulls being unions of nulls from independent studies on the same features. In particular, we are interested in testing the composite null hypotheses on p mutual features from K independent experiments X k 1 , · · · , X k p , k = 1, · · · , K, for their associations with the outcomes Y k , k = 1, · · · , K. Notice that the outcome variables Y k 's can be with different data types (continuous, count, nominal, ordinal, mixed, etc.) , and (X k 1 , · · · , X k p ) can have different distributions among different experiments. For example, Y k 's can be different disease outcomes and X k 's can be gene expression data measured with different scales. Define H k 0 j as the indicator of the j-th feature being null in the kth experiment (i.e. X k j ⊥ ⊥ Y k |X k −j where X k −j := {X k 1 , · · · , X k p } \ X k j ) and denote H k = {j ∈ [p] : H k 0 j is true}, where [p] = {1, · · · , p}. Instead of testing H k 0 j 's, we are interested in testing the composite null H 0j = ∪ K k=1 H k 0 j , k ∈ [K], where [K] = {1, · · · , K}. The set of signals we are interested is:

: H 0j is not true} and H = S c = {j ∈ [p] : H 0j is true}.

The FDR for a selectionŜ is defined as

The main contributions of this paper are summarized below:

• We construct a knockoff-based FDR control procedure for testing unions of null hypotheses on multiple features, namely simultaneous knockoff, under mild assumptions similar to the Fixed-X or Model-X knockoff frameworks. Specifically, our proposed framework does not require independence among the p candidate features, but only assumes the K experiments are independent.

• We prove that using the simultaneous knockoff procedure, FDR can be exactly controlled for testing unions of null hypotheses in finite sample settings.

• We demonstrate the FDR control performance of our method with extensive simulations and a real data application.

The problem of testing multiple hypotheses with composite null is related to many important scientific areas, for example, reproducible research (Bogomolov & Heller, 2013; Heller, Bogomolov, & Benjamini, 2014; Heller & Yekutieli, 2014) , comparative research in genomics studies (Rittschof et al., 2014) , and mediation analysis (Sampson, Boca, Moore, & Heller, 2018) . In this section, we give some examples that can be formulated into multiple testing problem with the composite null.

Replicability is an important subject in many scientific areas. Replicability analysis aims at identifying findings independently discovered across multiple experiments. These experiments could be slightly different because they are conducted in different institutions, by different experimenters, or at different times. For example, in genetic studies, the association between single nucleotide polymorphisms (SNPs) and phenotype is recognized as a scientific finding only if it has been discovered from different independent studies with the same features and different cohorts . In other words, for a feature (or SNP in this example) to be selected as a signal, it has to be selected from all the independent studies. Mathematically, in reproducibility research with K independent experiments, we denote the K data sets as (Y 1 , X 1 ), · · · , (Y K , X K ) where Y k ∈ R n k and X k ∈ R n k ×p for k = 1, · · · , K. The i-th row of (Y k , X k ) are i.i.d. sampled from joint distribution of (Y k , X k ), which in this case does not depend on k and thus can be simply denoted as D,

iid ∼ D for k = 1, · · · , K and i = 1, · · · , n k . To identify the j-th feature as signal, we need to test the composite null H 0j , which is the union of the null hypotheses H 0 j,k from the K independent studies.

As a remark, for reproducibility research, when we assume (Y k i , X k i1 , · · · , X k ip ) are i.i.d. for all k and i, the composite null hypotheses set H is identical to the null sets from individual experiments H k for all k ∈ [K]. In this case, we can alternatively merge data from the K experiments and do variable selections on the merged data set. However, in many cases, this is not practical because of privacy and data ownership issues. Testing the union composite nulls provides an opportunity combine information from different sources while protecting privacy. On the other hand, when the joint distributions (Y k i , X k i1 , · · · , X k ip )'s are heterogeneous across different experiments, the two composite null definitions (union vs. intersection) diverges, and testing the union composite nulls provides more meaningful information for the purpose of reproducibile research. We expand this point in our second motivating example below.

When identifying risk factors, designing new treatments, or making policies, there is a need to guarantee consistent results across sub-populations. The sub-populations may have different demographic distributions; therefore there is potential population heterogeneity. To give a concrete data example, the National COVID Cohort Collaborative (N3C) Enclave contains electronic health record (EHR) data for patients with a positive test for COVID-19 since January 1 2020 from 53 sites (Haendel et al., 2020) . It provides opportunities to study the associations between different comorbidities and COVID-19 severity from the large cohort. The population heterogeneity from different sites brings challenges for the information aggregation across different sites: patients from different sites have different distributions in their demographic information, comorbidities and economics status; and the potential predictors are recorded differently at different hospitals (For example, at some locations comorbidities are recorded as binary variables, whereas at others they are recorded as categorical variables with more levels, or continuous variables). The dependence of COVID-19 severity on the predictors at different sites can also be slightly different. In combining the information from different sites, we aim at identifying features that are selected within all the sites studied. To form this mathematically, we work with K independent sub-populations with a mutual response Y (COVID-19 severity), but the candidate features X from different sub-populations can have different distributions. We aim at selecting features that are signals in all the sub-populations. We can build K studies for the K sub-populations (Y k i , X k i1 , · · · , X k ip )

iid ∼ D k , for k = 1, · · · , K, and test the conditional independence Y k ⊥ ⊥ X k j |X k −j , for k = 1, · · · , K and j = 1, · · · , p, where X k −j := {X k 1 , · · · , X k p } \ X k j .

The j-th feature is a mutual signal if and only if the null hypotheses of feature X k j in all K studies do not hold. Therefore this is a problem of testing the composite null H 0j .

In many scientific fields, it is important to identify features that are associated with multiple responses. In particular, mediators can be discovered from simultaneous associations with the treatment and the outcome. Our proposed FDR controlling procedure for the composite nulls can potentially be used for high dimensional mediator selection problems. For example, suppose we aim at identifying genome regions that are both associated with the treatment and the outcome of a certain disease. We can select genome regions as mediators if they are both associated with the treatment and the outcome. To do this, we can combine information from two independent studies, one on the associations between the genome and the treatment, the other on the association between the genome and the disease outcome with treatment fixed. The mediators will be defined as features that are associated with both the treatment and the outcome. Therefore the selection of mediators from high dimensional features can be framed as a problem of testing the composite null with K = 2 studies about p candidate features. Here we have (Y k i , X k i1 , · · · , X k ip ) iid ∼ D k , for k = 1, 2 where Y 1 ∈ R n 1 and Y 2 ∈ R n 2 are the treatment and outcome respectively; and the signals for Y 1 and Y 2 are not necessarily identical. We test the conditional independence Y k ⊥ ⊥ X k j |X k −j , for k = 1, 2, and j = 1, · · · , p.

The j-th feature is a mediator if and only if the null hypotheses in both models with the treatment (H 1 0 j ) and the outcome (H 2 0 j ) do not hold. For reproducibility research, Bogomolov and Heller (2018) proposed a method based on the Benjamini-Hochberg (BH) Procedure by selecting features that are commonly selected among all the experiments. There are multiple works based on computing the local false discovery rate as the optimal scalar summary of the multivariate test statistics (Chi, 2008; Heller & Yekutieli, 2014) . Recently, Xiang, Zhao, and Tony Cai (2019) presented the signal classification problem for multiple sequences of multiple tests, where the identification of the simultaneous signal is a special case. All these methods require some knowledge of the null distribution. More recently, Zhao and Nguyen (2020) proposed a nonparametric method for asymptotic false discovery rate control in identifying simultaneous signals. This method does not require knowledge of the null distribution. However, all methods above assume not only the independence of the experiments but also the independence (or weak dependence) of the features within each experiment, which is not realistic in complex high-dimensional data applications such as the "omics" data.

For multiple testing problems with a single experiment, there are recent advances in relaxing the assumption of independence among the features, for example, the knockoff filters (Barber & Candès, 2015; Candès et al., 2018) . The original knockoff filters proposed by Barber and Candès (2015) works on a statistical linear model where no knowledge of the design or covariates, signal amplitude, or the noise level, and it achieves exact FDR control in finite sample settings. It also can work in high dimensional settings (Barber & Candès, 2019) . Later Candès et al. (2018) proposed the Model-X knockoff, extending the knockoff filter to achieve exact FDR control for nonlinear models, where the conditional distribution of the response is allowed to be arbitrary and completely unknown. Model-X knockoff requires that the samples are i.i.d. and the covariate distribution is known. Barber, Candès, and Samworth (2020) further showed that Model-X knockoff is robust to errors in the underlying assumptions on the distribution of X. In addition, Huang and Janson (2020) relaxed the assumption of Model-X knockoff so that the FDR can be controlled as long as the distribution of X is known up to a parametric model. There are also abundant publications on the construction of knockoffs with approximate X distribution. Romano, Sesia, and Candès (2020) developed a Deep Knockoff machine using deep generative models. Liu and Zheng (2019) developed a Model-X generating method using Deep latent variable models. More recently, Bates, Candès, Janson, and Wang (2020) proposed an efficient general metropolized knockoff sampler. Spector and Janson (2020) proposed constructing knockoffs minimize the constructability.

In this paper, we propose a knockoff-based procedure to establish exact FDR control for the multiple hypotheses testing with composite nulls as we defined in (1). This method enjoys two aspects of advantages: the theoretical guarantees in exact controlling FDR, and the mild model assumptions. The rest of the paper is organized as follows. In Section 2, we present the simultaneous knockoff framework. In Section 3, we give the theoretical guarantees for the exact FDR control of simultaneous knockoff in the finite sample setting and a robustness result for the potential mis-specification of X k distributions; In Section 4, we show the empirical performance of the simultaneous knockoff with different data structures on (Y k , X k ). Finally in Section 5, we apply the simultaneous knockoff procedure to two real data examples.

In this section, we present the simultaneous knockoff procedure, which is an extended knockoff framework for the FDR control in selecting features that are with simultaneous associations in multiple independent experiments. This framework can be paired with both the Fixed-X knockoff and Model-X knockoff to allow for various data structures in the real application. We first define our problem of interest mathematically and present the relevant notation. Then we provide a brief review of the Fixed-X knockoff and Model-X knockoff. Finally, we present the simultaneous knockoff method.

In this section, we introduce notations used for testing multiple unions of null hypotheses. Suppose we have K independent experiments sharing the same p candidate predictors, where p can be large. Within the k-th experiment, there are n k i.i.d. samples drawn from (Y k i , X k i1 , · · · , X k ip ) ∼ D k . The outcomes Y k 's are allowed to be with different data types (continuous/count/nominal/ordinal/mixed) among the K experiments. We denote the feature matrix X k ∈ R n k ×p . Here X k 's represents the same set of candidate features, but are allowed to have different joint distributions among the K experiments. 1 Now for experiment k ∈ [K], we can construct null hypotheses H k 0 j for j ∈ [p] to represent the conditional independence between Y k and X k j conditioning on

have to be identified as a signal for all K experiments. We denote the null set for experiment k as H k = {j ∈ [p] : H k 0 j is true} and the composite null set is defined as

: H 0j is true}. Our goal is to develop a procedure so that we can select the largest possible setŜ ⊂ [p] such that the false discovery rate (FDR) is under control. The FDR we are interested in is defined as:

Both Fixed-X knockoff (Barber & Candès, 2015) and Model-X knockoff (Candès et al., 2018) procedures can be used for exact FDR control in variable selections in single experiments depending on different assumptions. The Fixed-X knockoff filter, or the original knockoff, was proposed by Barber and Candès (2015) . The setup is a decentralized linear model,

Model-X knockoff does not require the dependence Y |X to be known by assuming the knowledge of the distribution of X. It can work with many more models and allow more flexible usage of test statistics for each feature. It tests the multiple hypotheses H 0j : Y ⊥ ⊥ X j |X −j for j = 1, · · · , p; the null set is defined as H = {j ∈ [p] : H 0j is true}.

To run the knockoff method, there are three main steps: constructing knockoffs, calculating test statistics, and filtering the results.

• For the first step, a set of p knockoff features X j , j = 1, . . . , p need to be constructed for each feature X j . For the Fixed-X knockoff, the matrix of knockoffs X = [ X 1 . . . X p ] needs to satisfy that, for some vector s ≥ 0,

and X j can be computed using an efficient semidefinite programming (SDP) algorithm; and for the Model-X knockoff, the knockoff copy X needs to satisfy the pairwise exchangeability condition:

where Swap(j) stands for exchanging the j-th column and the (j +p)-th column. X j can be sampled using various algorithms (Bates et al., 2020; Liu & Zheng, 2019; Romano et al., 2020; Spector & Janson, 2020) .

• For the second step, appropriate test statistics need to be selected and calculated for each feature as well as its knockoff copy. For the Fixed-X knockoff, the statistics Z ∈ R 2p need to have the form Z = z([XX]

[XX], [XX] Y ) and for the Model-X knockoff, Z = z([X X], Y ) needs to satisfy that if we swap features X j with its corresponding knockoff X j , then the statistics Z j and Z j+p get swapped.

As an example, we can choose the lasso variable selection procedure and construct the statistics as β(λ), where

We can run over a range of λ values decreasing from +∞ (a fully sparse model) to 0 (a fully dense model) and define Z j as the maximum λ such thatβ j (λ) = 0 or use |β j (λ)| with a specific λ value. For Model-X knockoff, we can use penalized generalized linear regression model to replace the linear model above and we can also use variable importance from random forest.

• For the last step, we construct

. If X j is a true signal, we would expect statistics P {Z j > Z j+p } > 0.5, while if X j is null, we would expect Z j and Z j+p to have the same distribution. Thus, we expect W j to have positive sign with > 0.5 probability if X j is a signal and W j has random sign if X j is a null. This allows us to estimate the proportion of false discovery in S(t) as

which supports the usage of following stopping rules from the SeqSelect procedure Barber and Candès (2015) to determine the cut-off points for finite sample FDR control. The cut-off point

controls the modified FDR (defined as E |{j∈Ŝ∩H}| |Ŝ|+1/q ) and the cut-off point τ + = min t > W + :

controls the FDR.

Fixed-X knockoff and Model-X knockoff procedures have mild model assumptions in different aspects. With Fixed-X knockoff, although the dependence E [Y |X] has to be linear and Y |X has to be normally distributed, the distribution of X can be arbitrary and unknown. The Model-X knockoff, on the other hand, can work with arbitrary conditional distribution of Y |X, however, requires the knowledge of the distribution of X. These knockoff-based framework cannot be directly applied to identify simultaneous signals from independent experiments unless the joint distribution of (Y

There are some extensions of the knockoff methods to work with group-wise variable selection (Chen et al., 2019; Dai & Barber, 2016) . These methods allow us to handle group hypotheses with in each individual experiment. However, for notation simplicity, we focus on the case where H k 0 j 's are individual hypotheses rather than group hypotheses in the presentation of the simultanenous knockoff. The extension to work with group hypotheses is straight forward.

Before describing the simultaneous knockoff procedure in general, we need to introduce several definitions.

Definition 1. (Swapping)For a set S ⊂ {1, · · · , p}, and for Z ∈ R 2p , Z Swap(S) means swapping the entries Z j with Z j+p for all j ∈ S.

for all j ∈ S and j = 1 otherwise, where denotes pointwise multiplication.

As a remark, entry-wise the flip sign function is equivalent to the antisymmetric function as introduced in Candès et al. (2018) . Examples of flip sign functions include the difference function

and the signed max function

where Z k ,Z k ∈ R p for k ∈ {1, · · · , K}.

One example of one swap combining function can be defined as below: Denote

We separate A to two sets: the even set A e = {a : mod(||a|| 1 , 2) = mod(K, 2)}; and the odd set A o = {a : mod(||a|| 1 , 2) = mod(K + 1, 2)}. Then we define [Z,Z] as below:

where Z jk andZ jk are the j-th entry of Z k andZ k respectively. In particular, when K = 2, this construction can be written as:

The simultaneous knockoff procedure allows for testing the union of nulls from multiple independent experiments, as long as every single experiment satisfies the model assumptions for either Fixed-X knockoff or Model-X knockoff. The simultaneous knockoff procedure is as described below:

Step 1: Construct knockoff matrices for individual experiments X 1 , · · · , X K . Denote the knockoff matrices asX 1 , · · · ,X K .

Remark 1. The knockoff matrices for each experiments can be either Fixed-X knockoff Barber and Candès (2015) or Model-X knockoff Candès et al. (2018) . We allow different responses Y k and X k distributions in different experiments; we also allow using different knockoff construction methods for different experiments.

Step 2:

Remark 2. We allow the use of different kinds of test statistics for different experiments.

Step 3: Construct simultaneous statistics

where f 1 (·) is a one swap combining function (Definition 3).

Step 4: Construct statistics W ∈ R p for the p features with

is a flip sign function (Definition 2).

Step 5: Construct selection setŜ(q) ∈ {1, · · · , p} with desired FDR level q. We can use the Knockoff filter or the Knockoff+ filter as defined later in (5) and (7).

Theorem 1. Choose a threshold τ > 0 by setting

where q is the target FDR level (or τ = +∞ if the set above is empty) and W + = {|W j | : |W j | > 0}. Then the simultaneous knockoff procedure selectinĝ

controls the modified FDR defined as

The slightly modified Knockoff+ procedure where

The key step for the proof of Theorem 1 is to show that the signs of W j 's for the composite nulls are i.i.d. following a Bernoulli( 1 2 ) distribution, and independent of |W j | for all j. As in the knockoff-based methods this property effectively guarantees that for all j ∈ H, there is equal probability of selecting the feature and its knockoff copy, which allows the knockoff procedure to estimate an upper bound of the true false discovery proportion. We show this in the following lemma 1. The details of the proof can be found in the supplementary material.

where f 2 is a flip sign function. Let ∈ {±1} p be a sign sequence independent of W, with j = +1 for all nonnull j and j ∼ {±1} for null j. Then

When the model is nonlinear, we need to use the model-X approach to create knockoffs in each subsample. In real applications the distribution of the candidate features might not be known exactly. In this case it is important to establish the robustness against the mis-specification of X distribution. The following theorem shows the result.

Theorem 2. Under the definitions in Section 2, for any ≥ 0, consider the null variables for which min

where P denotes the true distribution and Q denotes the misspecified distribution. If we use the knockoff+ filter, then the fraction of the rejections corresponding to such nulls obeys

In particular, this implies that the false discovery rate is bounded as

Similarly, if we use the knockoff filter, for any ≥ 0, a slightly modified fraction of the rejections corresponding to nulls with min k:j∈H k KL k j ≤ obeys E   |{j : j ∈Ŝ ∩ H and min k:

and from this, we obtain a bound on the modified FDR:

When we have an additional large sample of X (for estimating X distribution), we will be able to have small enough. Otherwise, it has been proposed to evaluate the potential inflation of FDR using simulation (Romano et al., 2020) . In Theorem 2, we show a lower bound result for simultaneous knockoff similar to the results in Barber et al. (2020) for Model-X knockoff, to build some statistical foundations for such simulation approach.

In this section, we present numerical studies on the performance of our proposed simultaneous knockoff method. We let K = 2 and consider three types of settings. In Setting 1 (continuous), in both experiments, Y k i 's are continuous and Y k i |X k i 's follow linear models. In Setting 2 (binary), Y k i 's are binary and Y k i |X k i 's follow logistic models. In Setting 3 (mixed), Y k i is continuous and Y 1 i |X 1 i follows a linear model; Y 2 i is binary and Y 2 i |X 2 i follows a probit model.

In

Step 1 of our method, for the linear model, we use the Fixed-X knockoff method (Barber & Candès, 2015) for the knockoff construction for single experiments. For the logistic model, we use the Model-X knockoff method (Candès et al., 2018) with second-order Gaussian knockoffs construction for single experiments. In Step 2, we choose the absolute value of the coefficient as our statistics. In Step 3, we use the one swap combining function as defined in (4). In Step 4, we use the difference function in (3) as the flip-sign function. In

Step 5, we control the FDR at level q = 0.2 and test the performance of the knockoff filter.

For Setting 1 (continuous) and Setting 2 (binary), we compare our proposed simultaneous knockoff method with two alternative methods. The first method pooling directly pools all data together and tests the conditional association between X j and Y using the corresponding knockoff method for a single experiment. The second method, intersection, selects features from each of the K experiments separately using knockoff methods for a single experiment and then takes the intersection of the selected sets to obtain the final selection. For Setting 3 (mixed), use the pooling method is unnatural since the outcomes Y 1 and Y 2 are of different data types, therefore we only compare the simultaneous knockoff with intersection.

Data generation We first describe the data generation to obtain data for a single experiment (Y k , X k ). We first consider Setting 1 (continuous). We sample outcomes from the following linear model:

for k = 1, 2 and i = 1, · · · , n k .

Here σ k control the signal noise ratio at each sample. We simulate data with K = 2 independent experiments with sample sizes n 1 = n 2 = 1000 and number of covariates p = 200. For each setting, we run m = 1000 simulations to obtain power (the expected proportion of true signal being selected) and FDR. We first generate covariates X 1 i ∼ N (0, Σ(ρ 1 )), · · · , X K i ∼ N (0, Σ(ρ K )) independently, where Σ(ρ) is an autocorrelation matrix with its (i, j)-th element equals to ρ |i−j| . Next we generate coefficients β 1 , · · · , β K for the K experiments. We use s 0 to denote the number of simultaneous signals among the K experiments, and s k to be the signals only present in experiment k. We consider two scenarios for the simultaneous signals: 1. the effect strengths are identical among the K experiments and the directions (positive or negative) are the same; 2. only the direction of the simultaneous signals are the same but the signal strength are independent among the K experiments.

For Scenario 1, we sample η j ∈ R s j , j ∈ {0, · · · , K}, where η ji ∼ Uniform[0, 1] are independent for j = 0, · · · , K and i = 1, · · · , s j . Then we sample ∈ {−1, 1} p where l are independently sampled from rademacher distribution for l = 1, · · · , p. With K = 2, the coefficients β 1 , β 2 are determined by:

, where is the hadamard product.

For Scenario 2, we generate η 0k ∈ R s 0 for k = 1, · · · , K independently; i.e., we sample η 0ki ∼ Uniform[0, 1] independently for k = 1, · · · , K and i = 1, · · · , s 0 . We generate η 1 , η 2 and the same way as described in Scenario 1. The coefficients β 1 , β 2 are determined by:

Next, for Setting 2 (binary), we consider the following logistic model:

, for k = 1, 2 andi = 1, · · · , n k , where X 1 i , X 2 i and β 1 , β 2 are generated the same way as the continuous setting. Next, for Setting 3 (mixed), we first generate (Ȳ k , X k ) following the same procedure as Setting 1. Then we let Y 1 =Ȳ 1 and dichotomizeȲ 2 to construct Y 2 :

We vary the parameters s 0 , s 1 , s 2 , ρ 1 , ρ 2 , σ 1 , σ 2 , α 1 , α 2 in our simulation studies.

Simulation settings For the continuous and binary settings, we study the performance of the three methods (simultaneous, pooling, intersection). For the mixed setting, we consider two methods (simultaneous, intersection).

We first vary s 0 , s 1 , s 2 to study the effect of signal sparsity level and overlapping of signals between the two experiments. In particular, we study the following three cases: 1. s 1 = s 2 = 0; 2. s 1 = 0, s 2 = 0; 3. s 1 = s 2 = 0. Next, we vary ρ 1 and ρ 2 to study the effect of the correlations among the covariates. Third, we vary σ 1 and σ 2 (or α 1 and α 2 for the logistic regression) to understand the effect of difference in signal strengths between the two experiments. Detailed settings are postponed in the supplementary material.

We first discuss about the effect of the non-simultaneous signals. Figure 1 shows the power and FDR for the three settings when we vary s 1 = s 2 . As s 1 = s 2 increases, only simultaneous knockoff method controls FDR. The simultaneous knockoff has slightly lower power than the pooling, and the power gap is larger when the signals in the two experiments have different strength (Scenario 2, right panel). This tells us that when non-simultaneous signals exist, only the simultaneous knockoff method controls the FDR, which is as expected from the theoretical guarantees.

More simulation results can be found in the Supplemental Materials. For the continuous settings ( Figures S1 and S2) , when there are only simultaneous signals present (s 1 = s 2 = 0) and σ 1 = σ 2 , the pooling method also controls FDR. When s 1 = s 2 = 0, but σ 1 = σ 2 in general, the pooling method does not control FDR. When there exist nonsimultaneous signals in one group (s 1 = 0, s 2 = 0), the pooling method fails to control FDR, and the FDR increases as s 2 increases. The intersection method still controls FDR and has slightly lower power than the proposed simultaneous method. When non-simultaneous signals are present in both groups (s 1 = s 2 = 0), we observe that with s 1 = s 2 increases, the intersection method does not control FDR. In this case, only the simultaneous method successfully controls FDR with very small power loss compared with the other two methods. From the experiments with varying ρ 1 , ρ 2 , power for both simultaneous and intersection methods decrease slightly as |ρ 2 − ρ 1 | increases. The power does not change as we have |σ 2 2 −σ 2 1 | increase. In general, for Scenario 1, where both the effect strengths and the directions (positive or negative) are the same, the simultaneous and intersection have similar power, which is only slightly lower than the pooling method.

For the binary settings ( Figures S3 and S4) , we see similar results as the continuous setting. When there are only simultaneous signals present (s 1 = s 2 = 0), all three methods control FDR and the pooling method has the best power, and our proposed simultaneous method has better power than the intersection method. When there exist non-simultaneous signals in one group (s 1 = 0, s 2 = 0), the pooling method fails to control FDR, the intersection method still controls FDR, but it has lower power than the proposed simultaneous method. When non-simultaneous signals are present in both groups (s 1 = s 2 = 0), we observe that with s 1 = s 2 increases, the intersection method does not control FDR. In this case, only the simultaneous method successfully controls FDR. The power loss of the simultaneous method compared with the pooling method is larger than the continuous settings, but still acceptable. The performance does not change much with changing ρ 1 , ρ 2 or α 1 , α 2 .

For the mixed setting ( Figures S5 and S6) , the results are similar to the continuous setting. The intersection method does not control FDR when s 1 = s 2 ≥ 50. The power of the simultaneous knockoff method is slightly better. The simulation results are in agreement with our theoretical expectations. First, in terms of FDR control, our proposed simultaneous knockoff method always controls FDR across all our designed settings. The pooling method only controls FDR when s 1 = s 2 = 0 and σ 1 = σ 2 , because in this case, all samples from the two experiments are i.i.d.. The intersection method controls FDR when s 2 = 0 but it fails when s 1 = s 2 = 0. In terms of power, there is some gap between simultaneous and pooling methods, because the composite null tests are more stringent. However, the gap is moderate. The powers of simultaneous and intersection are similar.

In this section, we demonstrate the application of the simultaneous knockoff for fairness in variable selection with FDR control in a crime rate study. Our goal for this study is to identify features that are universally associated with the crime rate, regardless of the race distribution in the community. This is potentially useful in guiding unbiased policymaking based on the race-blinded findings. To achieve that, we select features that are simultaneously associated with the crime rate in different race distribution groups.

We use the publicly available Communities and Crime data set from the University of California Irvine (UCI) machine learning repository ( https://archive.ics.uci.edu/ml/ datasets/communities+and+crime). The data set contains crime information on n = 1994 communities with different race distributions in the US. For individual communities, it has information on the crime rate, as well as 122 other variables that are potentially related to the crime rate. All continuous variables are normalized to the 0-1 range. Our primary outcome of interest is the normalized crime rate, our feature candidates are the p = 98 features with no missing values. We split the data into two subsets with approximately equal sample sizes according to the proportion of the Caucasian population within the community. We fit a linear regression model with each subset of the data and aim at identifying simultaneous signals from both subset models with FDR control; where the signals are independently associated with the outcome after adjusting for all the other features. We compare variable selection procedures simultaneous, pooling, and intersection; details of these methods are the same as described in Section 4. For the knockoff filter, we use the Knockoff version (5) instead of Knockoff+ (7). We also compare with an additional method repFDR . The repFDR is developed for replicability studies, which requires that under the null the Z-scores are normally distributed. Table 1 shows the results of identified features from different algorithms with targeted FDR level q = 0.1. Our proposed simultaneous method selected the following variables: percentage of households with public assistance income in 1989; percentage of kids born to never married; percent of persons in dense housing. The pooling method selected percentage of kids born to never married; percent of persons in dense housing; and number of vacant households. The intersection method selected percentage of kids born to never married; and percent of persons in dense housing. The RepFDR method selected a lot more variables, it also selected kids born to never married; percent of persons in dense housing; and number of vacant households, which overlaps with the simultaneous method.

To verify the robustness of our proposed method, we also added a set of 98 fake features by permuting the covariates. The feature selection results are shown in Table 2 . The variable selections with the simultaneous and intersection are relatively stable, while repFDR method selected permuted fake features and the FDR is not controlled at the desired level (q = 0.1), which might be due to the violation of the assumptions by this data set.

In this section, we demonstrate the usage of simultaneous knockoff to identify gene expressions that are associated with glioblastoma multiforme (GBM) for both male and female sub-populations. GBM is known as a hallmark of malignant process, however the molecular mechanisms that dictate the locally invasive progression remain an active research area. In this example we use male and female sub-populations to demonstrate variable selection using our proposed simultaneous knockoff method to combine data from independent sources, potentially with heterogeneity. In real application the sub-populations can be much more complicated (i.e. from different sources, collected at different places and times, and with different technologies). Therefore the fact that simultaneous method does not require the data to be pooled makes it useful.

Our GBM gene expression data are from The Cancer Genome Atlas (TCGA). The data contains 501 subjects with overall survival outcome (in days) and 17813 level-3 gene-level expression data. There are n = 71 censored data; since the cencoring times are in general later then the event time, we remove the censored samples and only consider the n = 430 samples with observed death. We use marginal screening to pre-screen the genes with a p-value cutoff at p-val < 0.0001. After the pre-screening, there are p = 123 genes left. We do log-transformation on the genes and apply the simultaneous knockoff to identify genes associated with both male and female. We also compare with the variable selection results Table 3 . With the simultaneous knockoff method, two genes (FBXO18 and NMI) are selected. Both genes have been frequently studied for the relationship with cancer, including GBM (Du et al., 2018; Meng et al., 2015; Yan et al., 2020) . pooling method selected four genes (FBXO18, FUCA2, MYBL2 and RDM1), and intersection selected none.

The simultaneous knockoff method is a general framework for testing unions of null hypotheses on conditional associations between candidate features and outcomes. It has mild assumptions, that only require the experiments are independent and for each experiment, either the model assumptions for Fixed-X knockoff or Model-X knockoff hold. This gives us opportunities to combine information from experiments with heterogeneous covariate X data structures, different dependencies of Y |X, and even with different outcomes Y . The FDR control guarantee is exact at a finite sample size.

There are many open questions in multiple testing that are related to hypothesis testing for composite nulls. Although our simultaneous knockoff method provides solutions to the reproducibility of studies, feature selections across heterogeneous populations, and mediation analysis, there are still more challenges from the real applications. For example, we can further explore methods that will allow combining information from different data splits. We can also develop methods to handle experiments with different sets of candidate features. Proof. The proof of Theorem 1 follows the proof idea in Barber and Candès (2015) . Let m = #{j : W j = 0} and assume without loss of generality that |W 1 | ≥ |W 2 | ≥ · · · ≥ |W m | > 0. Define p-values p j = 1 if W j < 0 and p j = 1/2 if W j > 0, then Lemma 1 implies that null p-values are i.i.d. with p j ≥ Unif[0, 1] and are independent from nonnulls. We first show the result for the knockoff+ threshold. Define V = #{j ≤k + : p j ≤ 1/2, j ∈ H} and R = #{j ≤k + : p j ≤ 1/2} wherek + satisfy that |Wk

#{j ≤k + : p j ≤ 1/2, j ∈ H} 1 + #{j ≤k + : p j > 1/2, j ∈ H} 1 + #{j ≤k + : p j > 1/2, j ∈ H} #{j ≤k + :

where the first inequality holds by the definition ofk + and the second inequality holds by Lemma 2. Similarly, for the knockoff threshold, we have V = #{j ≤k 0 : p j ≤ 1/2, j ∈ H} and R = #{j ≤k 0 : p j ≤ 1/2} wherek 0 satisfies that |Wk 0 | = τ where τ is defined as in Theorem 1, then

#{j ≤k 0 : p j ≤ 1/2, j ∈ H} 1 + #{j ≤k 0 : p j > 1/2, j ∈ H} 1 + #{j ≤k 0 : p j > 1/2, j ∈ H}

where the first inequality holds by the definition ofk 0 and the second inequality holds by Lemma 2.

A.2 Additional proofs for Theorem 1

Using the swap combining property of f 1 , we have that [Z,Z] satisfies

for any set S ⊆ H.

For any S ⊆ H, we can write it as the union of K subsets S = ∪ K k=1 S k , where S k ⊆ H k for k = 1, · · · , K and S k1 ∩ S k2 = ∅ for all k 1 = k 2 . In particular, we can let

. By the mutually independence between [Z 1 ,Z 1 ], · · · , [Z K ,Z K ], we have

We can then similarly apply the swapping operator to S 3 , · · · , S K to show [Z,Z] d =

[Z,Z] Swap(S) . Therefore, for W Swap(S) := f 2 [Z,Z] Swap(S) d = f 2 [Z,Z] = W for all S ⊆ H. Next because f is a flip sign function, for = S where S = {j : Sj = −1}, we know (W 1 · S1 , · · · , W p · Sp ) = W Swap(S) . By definition of , we know that S ⊆ H. We finish the proof of the lemma.

Lemma 2. For k = m, m − 1, · · · , 1, 0, put V + (k) = #{j : 1 ≤ j ≤ k, p j ≤ 1/2, j ∈ H} and V − (k) = #{j : 1 ≤ j ≤ k, p j > 1/2, j ∈ H} with the convention that V ± (0) = 0. Let F k be the filtration defined by knowing all the non-null p-values, as well as V ± (k ) for all k ≥ k. Then the process M (k) = V + (k) 1+V − (k) is a super-martingale running backward in time with respect to F k . For any fixed q,k =k + ork =k 0 as defined in the proof of Theorem 1 are stopping times, and as consequences E #{j ≤k : p j ≤ 1/2, j ∈ H} 1 + #{j ≤k : p j > 1/2, j ∈ H} ≤ 1

Proof. The filtration F k contain the information of whether k is null and non-null process is known exactly. If k is non-null, then M (k − 1) = M (k) and if k is null, we have

Given that nulls are i.i.d., we have

.

So when k is null, we have

This finish the proof of super-martingale property.k is a stopping time with respect to {F k } since {k ≥ k} ∈ F k . So we have E M (k) ≤ E [M (m)] = E #{j:p j ≤1/2,j∈H} 1+#{j:p j >1/2,j∈H} . Let X = #{j : p j ≤ 1/2, j ∈ H}, given that p j ≥ U nif [0, 1] independently for all nulls, we have X ≤ d Binomial(N, 1/2). Let Y ∼ Binomial(N, 1/2) where N is the total number of nulls. Given that f (

Proof. The proof of Theorem 2 follows the proof idea in Barber et al. (2020) . Define

Then for the knockoff+ filter with threshold τ + , we have |{j : j ∈Ŝ ∩ H and min k:j∈H k KL kj ≤ }|

The inequality holds by the construction of knockoff+ under FDR level q and the fact

For the knockoff filter at threshold τ , we have similar results for the modified FDR as below:

|{j : j ∈Ŝ ∩ H and min k:j∈H k KL kj ≤ }|

Next we will show that E [R (T )] ≤ e for both T = τ and T = τ + to complete the proof.

For events E j = min k:j∈H k KL kj , we have by Lemma 3,

So we have

The last step is because that if W j ≥ −T j for all j, then we have the summation within the expectation is 0, otherwise, the summation is 1 using lemma 6 of Barber et al. (2020) j∈H

This complete the proof of Theorem 2.

A.4 Additional proofs for Theorem 2 Lemma 3. For E j := min k:j∈H k KL kj , we have

Proof of lemma 3. For any j ∈ H 0 , find k ∈ H 0j such that KL kj = min k :j∈H k KL k j . Define X −(k,j) = {X 1 1 , · · · , X 1 p , · · · , X K 1 , · · · , X K p } \ {X k j }.

We will be conditioning on X −(k,j) ,X −(k,j) , Y and observing the unordered pair {X k j ,X k j } = {X k(0) j , X k(1) j } (we do not know which is which). Without loss of generality, let W j ≥ 0 when

This prove the lemma.

For Setting 1 and Setting 3, we change the parameters as follows:

• Varying s 0 , s 1 , s 2 . Fixing ρ 1 = ρ 2 = 0.5 and σ 2 1 = 1, σ 2 2 = 4.

1. Fixing s 1 = s 2 = 0, we vary s 0 = 10, 20, 30, 40, 50, 60; 2. Fixing s 0 = 40, we vary s 1 = s 2 = 10, 20, 30, 40, 50, 60; 3. Fixing s 0 = 40 and s 1 = 0, we vary s 2 = 10, 20, 30, 40, 50, 60.

• Varying ρ 1 , ρ 2 . Fixing σ 2 1 = 1, σ 2 2 = 4, Let ρ 1 = 0.25, 0.3, 0.35, 0.4, 0.45, 0.5 and ρ 2 = 1 − ρ 1 for the following three choices of s 0 , s 1 , s 2 : s 0 = 40, s 1 = 0, s 2 = 0; s 0 = 40, s 1 = 40, s 2 = 40; s 0 = 40, s 1 = 0, s 2 = 40.

• Varying σ 1 , σ 2 . Fixing ρ 1 = ρ 2 = 0.5, Let σ 2 1 + σ 2 2 = 5, σ 2 1 = 0.5, 1, 1.5., 2, 2.5, for the following three choices of s 0 , s 1 , s 2 : s 0 = 40, s 1 = 0, s 2 = 0; s 0 = 40, s 1 = 40, s 2 = 40; s 0 = 40, s 1 = 0, s 2 = 40.

For Setting 2 (binary), we vary the following parameters.

• Varying s 0 , s 1 , s 2 . Fixing ρ 1 = ρ 2 = 0.5 and α 1 = 1, α 2 = −1.

1. Fixing s 1 = s 2 = 0, we vary s 0 = 10, 20, 30, 40, 50, 60; 2. Fixing s 0 = 40, we vary s 1 = s 2 = 10, 20, 30, 40, 50, 60; 3. Fixing s 0 = 40 and s 1 = 0, we vary s 2 = 10, 20, 30, 40, 50, 60.

• Varying ρ 1 , ρ 2 . Fixing α 1 = 1, α 2 = −1, Let ρ 1 = 0.25, 0.3, 0.35, 0.4, 0.45, 0.5 and ρ 2 = 1 − ρ 1 for the following three choices of s 0 , s 1 , s 2 : s 0 = 40, s 1 = 0, s 2 = 0; s 0 = 40, s 1 = 40, s 2 = 40; s 0 = 40, s 1 = 0, s 2 = 40.

• Varying α 1 , α 2 . Fixing ρ 1 = ρ 2 = 0.5, Let α 1 + α 2 = 0, α 1 = 0, 0.5, 1, 1.5, 2, 2.5, for the following three choices of s 0 , s 1 , s 2 :

s 0 = 40, s 1 = 0, s 2 = 0; s 0 = 40, s 1 = 40, s 2 = 40; s 0 = 40, s 1 = 0, s 2 = 40.

In the following figures we show the simulation results from all our simulation studies. In Figures 2 and 3 we show results for Setting 1 (continuous) with Scenario 1 (same signal strengths) and Scenario 2 (different signal strengths). In Figures 4 and 5 we show results for Setting 2 (binary) with Scenario 1 (same signal strengths) and Scenario 2 (different signal strengths). In Figures 6 and 7 we show results for Setting 3 (mixed) with Scenario 1 (same signal strengths) and Scenario 2 (different signal strengths). Column 1 includes the settings with s 1 = s 2 = 0, column 2 includes the settings with s 1 = s 2 = 0, column 3 includes the settings with s 1 = 0, s 2 = 0. Row 1 shows the experiments varying s 0 , s 1 , s 2 , row 2 shows experiments varying ρ 1 , ρ 2 , row 3 shows experiments varying σ 1 , σ 2 . Column 1 includes the settings with s 1 = s 2 = 0, column 2 includes the settings with s 1 = s 2 = 0, column 3 includes the settings with s 1 = 0, s 2 = 0. Row 1 shows the experiments varying s 0 , s 1 , s 2 , row 2 shows experiments varying ρ 1 , ρ 2 , row 3 shows experiments varying σ 1 , σ 2 .

Controlling the false discovery rate via knockoffs

A knockoff filter for high-dimensional selective inference

Robust inference with knockoffs

Metropolized knockoff sampling

Controlling the false discovery rate: A practical and powerful approach to multiple testing

Discovering findings that replicate from a primary study of high dimension to a follow-up study

Assessing replicability of findings across two studies of multiple features

Panning for gold: 'model-x' knockoffs for high dimensional controlled variable selection

A prototype knockoff filter for group selection with FDR control. Information and Inference: A

False discovery rate control with multivariate p -values

The knockoff filter for fdr control in group-sparse and multitask regression

Clinical significance of fbxo17 gene expression in high-grade glioma

The National COVID Cohort Collaborative (N3C): Rationale, design, infrastructure, and deployment

Deciding whether follow-up studies have replicated findings in a preliminary large-scale omics study

Replicability analysis for genome-wide association studies

Relaxing the assumptions of knockoffs by conditioning

Deep latent variable models for generating knockoffs

High expression of n-myc (and stat) interactor predicts poor prognosis and promotes tumor growth in human glioblastoma

Neuromolecular responses to social challenge: Common mechanisms across mouse, stickleback fish, and honey bee

Deep knockoffs