key: cord-0575725-fzfulrl1
authors: Ren, Jie; Zhou, Fei; Li, Xiaoxi; Ma, Shuangge; Jiang, Yu; Wu, Cen
title: Robust Bayesian variable selection for gene-environment interactions
date: 2020-06-09
journal: nan
DOI: nan
sha: dfc02f8055831518536675e7277c2ca99d0137f0
doc_id: 575725
cord_uid: fzfulrl1

Gene-environment (G$times$E) interactions have important implications to elucidate the etiology of complex diseases beyond the main genetic and environmental effects. Outliers and data contamination in disease phenotypes of G$times$E studies have been commonly encountered, leading to the development of a broad spectrum of robust regularization methods. Nevertheless, within the Bayesian framework, the issue has not been taken care of in existing studies. We develop a fully Bayesian robust variable selection method for G$times$E interaction studies. The proposed Bayesian method can effectively accommodate heavy-tailed errors and outliers in the response variable while conducting variable selection by accounting for structural sparsity. In particular, for the robust sparse group selection, the spike-and-slab priors have been imposed on both individual and group levels to identify important main and interaction effects robustly. An efficient Gibbs sampler has been developed to facilitate fast computation. Extensive simulation studies and analysis of both the diabetes data with SNP measurements from the Nurses' Health Study and TCGA melanoma data with gene expression measurements demonstrate the superior performance of the proposed method over multiple competing alternatives.

Deciphering the genetic architecture of complex diseases is a challenging task, as it demands the elucidation of the coordinated function of multiple genetic factors, their interactions, as well as gene-environment interactions. How the genetic contributions to influence the variations in the disease phenotypes are mediated by the environmental factors reveals a unique perspective of the disease etiology beyond the main genetic effects and their interactions (or epistasis) 1;2 . Till now, G×E interaction analyses have been extensively conducted, especially within the framework of genetic association studies 3;4 , to search for the important main and interaction effects that are associated with the disease trait 5 .

With the availability of a large amount of genetic factors, such as SNPs or gene expressions, G×E interactions are of high dimensionality even though the preselected environmental factors are usually low dimensional. Therefore, variable selection has emerged as a powerful tool to identify G×E interactions associated with the phenotypic traits 6;7 , and a surging amount of G×E studies have recently been conducted along this line, especially with regularization methods 8 .

A prominent trend among these studies is to incorporate robustness in regularized identification of main and interaction effects in order to accommodate data contamination and heavy-tailed distributions in the disease phenotypes. Take the datasets analyzed in this article for example. The disease outcomes of interest are weight from the Nurses' Health Study (NHS) and (log-transformed) Breslow's depth from The Cancer Genome Atlas (TCGA) Skin Cutaneous Melanoma (SKCM) data. We plot the two in Figure 1 , where the long tails can be clearly observed. In practice, such a heavy-tailed distribution is frequently encountered and arise due to multiple reasons. For instance, some phenotypes have skewness in nature. For the subjects' recruited for the NHS, their ages are in the range from 41 to 68 as the average age for the onset of type 2 diabetes is 45 9 . The subjects' weight among this age group does have a right-skewed tendency. In addition, in the study of complex diseases such as cancer, even patients of similar profiles may have different subtypes as rigorous accrual of patients is usually not affordable. The data from the major disease subtype can be viewed as being "contaminated" by other subtypes or outliers. As nonrobust approaches cannot efficiently accommodate data contamination and long tailed distributions, which inevitably leads to biased estimates and false identifications, the robust regularization methods have thus been extensively developed for G×E studies 7;8 . Nevertheless, within the Bayesian framework, robust variable selection methods have not been investigated for gene-environment interactions by far. In fact, our literature search indicates that only limited number of Bayesian variable selection methods have been developed for G×E studies, and none of them is robust 8 . Driven by the urgent need to conduct robust Bayesian analysis, we propose robust Bayesian variable selection methods tailored for interaction studies by adopting a Bayesian formulation of the least absolute deviation (LAD) regression to accommodate data contamination and long-tailed distributions in the phenotype. Such a formulation is a special case of the Bayesian quantile regression 10 . The LAD loss has been a very popular choice for developing robust regularization methods for data with structured sparsity, including networks 11;12 and sparse group structure 13 . Its computational convenience has been revealed within the Bayesian framework as efficient Gibbs sampler can be constructed when the loss is combined with LASSO, group LASSO and elastic net penalties 14 . Furthermore, following the strategy of bi-level selection from a nonrobust Bayesian setting 15 , we have developed the Bayesian LAD sparse group LASSO for robust G×E interaction studies. The spike-and-slab priors have been imposed on both the individual and group level to ensure the shrinkage of posterior estimates corresponding to unimportant main and interaction effects to zero exactly. Such a prior leads to the real sparsity and is superior over Laplacian types of shrinkage in terms of identification and prediction results [16] [17] [18] .

In this study, our objective is to tackle the challenging task of developing a fully Bayesian robust variable selection method for G×E interactions, which has been well motivated from the success of regularization methods (especially those robust ones) in G×E studies and a lack of robust interaction analysis within the Bayesian framework. The significance of the proposed study lies in the following aspects. First, it advances from existing Bayesian G×E studies by incorporating robustness to accommodate data contamination and heavytailed distributions in the disease phenotype. Second, on a broader scope, although robust Bayesian quantile regression based variable selection has been proposed under LASSO, group LASSO and elastic net, the more complicated sparse group (or bi-level) structure, which is of particular importance in high dimensional data analysis in general 19 , has not been fully understood yet. We are among the first to develop robust Bayesian sparse group LASSO for bi-level selection. Third, unlike existing Bayesian regularized quantile regression methods which build upon the priors under Laplacian type of shrinkage, we conduct efficient Bayesian regularization on both the individual and group levels by borrowing strength from the spikeand-slab priors, thus leading to better identification and prediction performance over the competing alternatives, as demonstrated in extensive simulation studies and case studies of NHS data with SNP measurements and TCGA melanoma data with gene expression measurements. To facilitate reproducible research and fast computation using our MCMC algorithms, we implement the proposed and alternative methods in C++, which are available from an open source R package roben 20 on CRAN.

Use subscript i to denote the ith subject. Let (X i , Y i , E i , W i ), (i = 1, . . . , n) be independent and identically distributed random vectors. Y i is a continuous response variable representing the phenotypic trait. X i is the p-dimensional vector of G factors. The environmental factors and clinical covariates are denoted as the k-and q-dimensional vectors E i and W i , respectively. Considering the following model:

where α t 's, θ m 's, γ j 's and ζ jm 's are the regression coefficients for the clinical covariates, environmental factors, genetic factors and G×E interactions, respectively. We define β j = (γ j , ζ j1 , . . . , ζ jk ) ≡ (β j1 , . . . , β jL ) and U ij = (X ij , X ij E i1 . . . , X ij E ik ) ≡ (U ij1 , . . . , U ijL ) , where L = k + 1. The coefficient vector β j represents all the main and interaction effects corresponding to the jth genetic measurement. The i 's are random errors. Without loss of generality, we assume that the data have been properly normalized so that the intercept can be omitted. Denote U i = (U i1 , . . . , U ip ) , α = (α 1 , . . . , α q ) , θ = (θ 1 , . . . , θ k ) and β = (β 1 , . . . , β p ) . The vector β is of length p × L. Then model (1) can be written in a more concise form as

The least absolute deviation (LAD) regression is well known for its robustness to long-tailed distributions in response. For a Bayesian formulation of LAD regression, we assume that i 's are i.i.d random variables from the Laplace distribution with density

where ν −1 is the scale parameter of the Laplace distribution. Let Y = (Y 1 , . . . , Y n ) . With clinical covariates W = (W 1 , . . . , W n ) , environment factors E = (E 1 , . . . , E n ) , and genetic main effects and G×E interactions U = (U 1 , . . . , U n ) , the likelihood function can be expressed as

Based on Kozumi and Kobayashi 21 , the Laplace distribution is equivalent to the mixture of an exponential and a scaled normal distribution. Specifically, let z andũ be the standard normal and exponential random variables, respectively. If a random variable follows the Laplace distribution with parameter ν, then it can be represented as follows

where κ = √ 8 is a constant. Therefore, the response Y i can be rewritten as

Then u follows the exponential distribution Exp(ν −1 ). We thus have the following hierarchical representation of the Laplace likelihood:

. This hierarchical representation allows us to express the likelihood function as a multivariate normal distribution, which is critical to construct a Gibbs sampler for efficient sampling of the regression coefficients corresponding to main and interaction effects robustly.

Remark : The Laplace distribution in Bayesian LAD regression can be treated as a special case of the asymmetric Laplace distribution (ALD) in Bayesian quantile regression 10; 22 . In Bayesian quantile regression, we assume that i follows the asymmetric Laplace distribution with density

where the check loss function is ρ τ ( i ) = i {τ − I( i < 0)} for the τ th quantile (0 < τ < 1). Note that, when τ = 0.5, the ALD in (6) reduces to a symmetric Laplace distribution defined in (3). Yu and Moyeed 10 have shown that maximizing a likelihood function under the asymmetric Laplace error distribution (6) is equivalent to minimizing the check loss function in quantile regression. Kozumi and Kobayashi 21 have proposed a Gibbs sampler for Bayesian quantile regression based on a location-scale mixture representation of the ALD. Specifically, withũ and z defined as above, the asymmetric Laplace error in (6) can be represented as

where

When τ = 0.5, we have ψ = 0 and κ = √ 8, and equation (7) reduces to the Laplace error in (5).

The proposed fully Bayesian sparse group variable selection is motivated by the following considerations. In model (1), the coefficient vector β j corresponds to the main and interaction effects with respect to the jth genetic variant. Whether the genetic variant is associated with the phenotype or not can be determined by whether β j = 0. A zero coefficient vector suggests that the variant does not have any effect on the disease outcome. If β j = 0, then a further investigation on the presence of the main effect, the interaction or both is of interest, which can be facilitated by examining the nonzero component in β j . Therefore, a tailored robust Bayesian variable selection method for G×E studies should accommodate the selection on both group (the entire vector of β j ) and individual (each component of β j ) levels at the same time.

In order to impose sparsity on both group and individual level to identify important main and interaction effects, we conduct the decomposition of β j by following the reparameterization from 15 . Specifically, β j is defined as

. . , ω jL } , ω jl ≥ 0 (l = 1, ..., L). To determine whether the jth genetic variant has any effect at all, we conduct group-level selection on b j by adopting the following multivariate spike-and-slab priors

where I L is an identity matrix, δ 0 (b j ) denotes a point mass at 0 L×1 and π 0 ∈ [0, 1]. We introduce a latent binary indicator variable φ b j for each group j(j = 1, . . . , p) to tackle the group-level selection. In particular, when φ b j = 0, the coefficient vector b j has a point mass density at zero and all predictors representing the main and interaction effects in the jth group are excluded from the model, indicating that the jth genetic factor is not associated with the phenotype. On the other hand, when φ b j = 1, the components in coefficient vector b j have non-zero values.

To further determine whether there is an important main genetic effect, G×E interaction or both, we impose sparsity within the group j by assigning the following spike-and-slab priors on each ω jl (j = 1, . . . , p and l = 1, . . . , L)

where N + (0, s 2 ) denotes a normal distribution, N (0, s 2 ), truncated below at 0. When the binary indicator variable φ w jl = 0, ω jl is set to zero by the point mass function δ 0 (ω jl ). Within the jth group, when the component ω jl = 0, we have β jl = 0 and the corresponding U jl is excluded from the model, even when b j = 0. This implies that the jth genetic variant does not have the main effect (if l=1) or the interaction effect with the (l − 1)th environment factor (if l > 1). The β jl is non-zero if and only if the vector b j = 0 and the individual element ω jl = 0.

In (8) and (9), π 0 and π 1 control the sparsity on the group and individual level, respectively. Their values should be carefully tuned. Fixing their values at 0.5 makes the prior essentially non-informative since equal prior probabilities are given to all the submodels. Instead of fixing π 0 and π 1 , we assign conjugate beta priors π 0 ∼ Beta(a 0 , b 0 ) and π 1 ∼ Beta(a 1 , b 1 ), which can automatically account for the uncertainty in choosing π 0 and π 1 . We fixed parameters a 0 = b 0 = a 1 = b 1 = 1, so that the priors are essentially non-informative. For computation convenience, we assign a conjugate Inverse-Gamma hyperprior on s 2 s 2 ∼ Inv-Gamma (1, η) η is estimated with the Monte Carlo EM algorithm 15;23 . For the gth EM update,

where the posterior expectation of 1 s 2 is estimated from the MCMC samples based on t (g−1) . To maintain conjugacy, we place a Gamma prior on ν,

where c and d are set to small values.

The joint posterior distribution of all the unknown parameters conditional on data can be expressed as

Define the coefficient vector without the jth group as β (j) = (β 1 , . . . , β j−1 , β j+1 , . . . , β p ) and the corresponding part of the design matrix as U (j) . Likewise, define the coefficient vector without the lth element in the jth group as β (jl) and the corresponding design matrix as U (jl) . Let l b j = p(b j = 0|rest), then the conditional posterior distribution of b j is a multivariate spike-and-slab distribution:

j can be derived as

The posterior distribution (10) is a mixture of a multivariate normal and a point mass at 0. Specifically, at the gth iteration of MCMC, b

j is set to 0, we have φ b(g) j = 0, which suggests that the jth genetic variant is not associated with the phenotype at the gth iteration. Otherwise, φ b(g) j = 1. In addition to the multivariate spike-and-slab distribution on the group level, on the individual level, the conditional posterior distribution of ω jl is also spike-and-slab. Let l w jl = p(ω jl = 0|rest), we have

. It can be shown that

where Φ(·) is the cumulative distribution function of the standard normal random variable. At the gth iteration, the value of φ w(g) jl can be determined by whether the ω (g)

jl is set to 0 or not. Recall that φ w(g) jl = 0 implies that the jth genetic variant does not have the main effect (if l=1) or the interaction effect with the (l − 1)th E factor (if l > 1).

The full conditional distribution for u i is Inverse-Gaussian:

where the shape parameter λ u i = 2ν, mean parameter

With the conjugate Inverse-Gamma prior, the posteriors of s 2 is still an Inverse-Gamma distribution

With conjugate Beta priors, π 0 and π 1 have beta posterior distributions

Last, the full conditional distribution for ν is Gamma distribution

where the shape parameter s ν = c + 3n 2 and the rate parameter

Under our prior setting, conditional posterior distributions of all unknown parameters have closed forms by conjugacy. Therefore, efficient Gibbs sampler can be constructed for the posterior distribution.

We term the proposed robust Bayesian sparse group variable selection with spike and slab priors as RBSG-SS, with direct competitors RBG-SS, RBL-SS and ones without spike and slab priors: RBSG, RBG and RBL. With the non-robust counterpart, there are 12 methods under comparison, which have all been implemented in the C++ based R package roben 20 available from CRAN. It is worth mentioning that besides RBSG-SS, RBG-SS, RBL-SS and RBSG have also been proposed for the first time. A summary of all the methods is provided below.

All the methods under comparison can be grouped according to three criteria: with or without robustness, with or without spike-and-slab priors, and the types of structured sparsity (individual-, group-and bi-level) accommodated through variable selection. We first describe the robust Bayesian methods with spike-and-slab priors: RBSG-SS, RBG-SS and RBL-SS, which have all been proposed for the first time. Among them, RBSG-SS is the "golden" method developed for conducting robust sparse group variable selection for G×E interactions with spike-and-slab priors on both the group and individual levels. Besides, RBG-SS and RBL-SS are robust Bayesian group level and individual level selection with spike-and-slab priors, respectively. The spike-and-slab prior has only been imposed on the group level in RBG-SS. Compared to RBSG-SS, it does not induce within group sparsity. On the other hand, RBL-SS conducts individual-level selection without accounting for group structure. An immediate family of robust methods related to the three are RBSG, RBG and RBL, which do not adopt spike-and-slab priors and cannot shrink coefficients corresponding to the main and interaction effects to zero exactly. While RBG and RBL can be directly derived based on Li et al. 14 , RBSG, robust Bayesian sparse group selection, has not been investigated in existing studies so far.

We have also included six non-robust methods for comparison. Among them, BSG-SS, BG-SS and BL-SS are the non-robust counterparts of RBSG-SS, RBG-SS and RBL-SS, respectively. In particular, the BSG-SS conducts (non-robust) Bayesian sparse group selection with spike-and-slab priors on group and individual level simultaneously, while variable selection has only been conducted on group (individual) level through RBG-SS (RBL-SS) under the spike-and-slab priors. In addition, BSG, BG and BL can be viewed as the benchmarks without incorporating spike-and-slab priors corresponding to BSG-SS, BG-SS and BL-SS. They can also be considered as the non-robust counterpart corresponding to RBSG, RBG and RBL. All the six non-robust alternatives can be readily derived based on existing studies.

For clarification, we list all the methods under comparison in Table 4 in the Appendix. Our contribution includes developing the 4 robust Bayesian variable selection approaches, RBSG-SS, RBG-SS, RBL-SS and RBSG among the first time. For all the rest of the approaches, a modification to the methods from the references provided in Table 4 by including clinical covariates is necessary. Otherwise, these methods cannot be adopted for a direct comparison with the four newly developed ones.

We comprehensively evaluate the proposed and alternative methods through simulation studies. Under all the settings, the responses are generated from model (1) with n = 500, q = 3, p = 100 and k = 5, which leads to a total dimension of 608 with 105 main effects, 500 interactions and 3 additional clinical covariates. The genetic main effects and G×E interactions form 100 groups with group size L = 6. We consider six error distributions for i 's: N(0, 1)(Error 1), Laplace(µ,b) with the mean µ = 0 and the scale parameter b = 2 (Error 2), 10%Laplace(0,1) + 90%Laplace(0, √ 5) (Error 3), 90%N(0,1) + 10%Cauchy(0, 1) (Error 4), t-distribution with 2 degrees of freedom (t(2)) (Error 4), LogNormal(0,1) (Error 5). All of them are heavy-tailed distributions except the first one.

We assess the performance in terms of identification and prediction accuracy. For methods incorporating spike-and-slab priors, we consider the median probability model (MPM) 15;24 to identify important effects. In particular, for the proposed RBSG-SS, we define φ jl = φ b j φ w jl for the lth predictor in the jth group. At the gth MCMC iterations, this predictor is included in the model if the indicator φ (g) jl is 1. Suppose we have collected G posterior samples from the MCMC after burn-ins, then the posterior probability of including the lth predictor from the jth group in the final model is

jl , j = 1, . . . , p and l = 1, . . . , L.

A higher posterior inclusion probability p jl can be interpreted as a stronger empirical evidence that the corresponding predictor has a non-zero coefficient and is associated with the phenotype. The MPM model is defined as the model consisting of predictors with at least 1 2 posterior inclusion probability. When the goal is to select a single model, Barbieri and Berger 24 recommend using MPM because of its optimal prediction performance. Meanwhile, the 95% credible interval (95%CI) 25 is adopted for methods without spike-and-slab priors. Prediction performance is evaluated using the mean prediction errors on an independently generated testing dataset under the same data generating model over 100 replicates. For all robust approaches, the prediction error is defined as mean absolute deviations (MAD). MAD can be computed as 1 n n i=1 |y i −ŷ i |. The prediction error for non-robust ones is defined as the mean squared error (MSE), i.e., 1

The G factors are simulated in the following 4 examples(settings). In the first example, a gene expression matrix with n = 500 and p = 100 has been generated from a multivariate normal distribution with marginal mean 0, marginal variance 1 and an auto-regression correlation structure (ρ = 0.3). In the second example, the single-nucleotide polymorphism (SNP) data are obtained by dichotomizing the gene expression values (from the first setting) at the 1st and 3rd quartiles, with the 3-level (0,1,2) for genotypes (aa,Aa,AA) respectively.

In the third setting, the SNP data are simulated under a pairwise linkage disequilibrium (LD) structure. Let the minor allele frequencies (MAFs) of two neighboring SNPs with risk alleles A and B be r 1 and r 2 , respectively. The frequencies of four haplotypes are as

where δ denotes the LD. Assuming Hardy-Weinberg equilibrium and given the allele frequency for A at locus 1, we can generate the SNP genotype (AA, Aa, aa) from a multinomial distribution with frequencies (r 2

. The genotypes at locus 2 can be simulated according to the conditional genotype probability matrix in Cui et al. 26 . We have δ = r p r 1 (1 − r 1 )r 2 (1 − r 2 ) with MAFs 0.3 and pairwise correlation r p = 0.6. In the last example, we consider a more practical correlation structure by extracting the first 100 SNPs from the NHS data analyzed in the case study, so the correlation is based on the real data. For each simulation replicate, we randomly sample 500 subjects from the dataset.

For E factors, five continuous variables are generated from a multivariate normal distribution with marginal mean 0, marginal variance 1 and AR correlation structure with ρ = 0.5. We then dichotomize one of them at 0 to create a binary E factor. Besides, we simulate three clinical covariates from a multivariate normal distribution and AR correlation structure with ρ = 0.5, and dichotomize one of them at 0 to create a binary variable.

For the clinical covariates and environmental main effects, the coefficients α t 's and θ m 's are generated from Uniform[0.8, 1.5]. For genetic main effect and G×E interactions, we randomly selected 25 β jl 's in 9 groups to have non-zero values that are generated from Uniform[0.3, 0.9]. All other β jl 's are set to zeros.

We have collected the posterior samples from the Gibbs sampler running 15,000 iterations while discarding the first 7,500 samples as burn-ins. The Bayesian estimates are calculated using the posterior medians. Simulation results for the gene expression data in Example 1 are tabulated in Table (1) and (2) . We can observe that the performance of methods that adopt spike-and-slab priors in Table ( 1) is consistently better than methods without spike-andslab priors in Table ( 2). Although, methods without spike-and-slab priors have slightly lower FPs than their counterparts with spike-and-slab priors under some error distributions, they tend to have much lower TPs and higher prediction errors under all the error distributions. For example, under Error2, RBSG identifies 14.48(SD 2.04) out of the 25 true positives, much lower than the true positives of 21.66(SD 1.72) from RBSG-SS. Meanwhile, its false positives 0.64(SD 0.85) is only slightly lower than the FP of RBSG-SS (1.32(SD 1.33)). The prediction error of RBSG, 2.57 with a SD of 0.11, is also inferior than that of the RBSG-SS (2.15(SD 0.10)). Such an advantage can also be observed by comparing other methods in Table ( 1) with their counterparts (without spike-and-slab priors) from Table (2) .

Among all the methods with spike-and-slab priors, as shown in Table ( 1), the proposed RBSG-SS has the best performance in both identification and prediction in the presence of data contamination and heavy-tailed errors. Under the mixture Laplace error (Error 3), RBSG-SS identifies 21.28(SD 2.24) true positives, with a small number of false positives, 1.48(SD 1.34). RBG-SS has a true positive of 24.80(SD 0.73), however, the number of false positives, 30.64(SD 4.23), is much higher. This is due to the fact that RBG-SS only conducts group level selection and does not impose the within-group sparsity. Compared to RBSG-SS, RBL-SS ignores the group structure, leading to fewer true positives of 18.14(SD2.68). In terms of prediction, RBSG-SS has the smallest L1 error, 2.29(0.12), among all the 3 robust methods with spike-and-slab priors. Although the difference in prediction error between RBSG-SS and RBG-SS is not distinct, considering the much smaller number of false positive main and interaction effects, we can fully observe the advantage of RBSG-SS over RBG-SS in prediction. Moreover, a cross-comparison between the robust and non-robust methods further demonstrates the necessity of developing robust Bayesian methods. For instance, under the error of t distribution with 2 degrees of freedom (Error 4), RBSG-SS has identified 23.80(SD 1.30) true main and interaction effects with only 0.53(SD 0.86) false positives. Its direct non-robust competitor, BSG-SS, leads to a true positive of 16.20(SD 6.45) with 3.73(SD 4.61) false effects. The superior performance of RBSG-SS over the other two non-robust methods, BG-SS and BL-SS, is also clear. Although a comparison between the prediction errors of robust and non-robust methods is not feasible as the two are computed under the L1 and least square errors, the identification results convincingly suggest the advantage of robust methods over non-robust ones, Similar patterns have been observed in Table 6 , 7, 8, 9, 10 and 11 for Examples 2, 3 and 4, respectively, in the Appendix. Overall, based on the investigations over all the methods through comprehensive simulation studies, we can establish the advantage of conducting robust Bayesian bi-level selection incorporating spike-and-slab priors.

We demonstrate the sensitivity of RBSG-SS for variable selection to the choice of the hyper-parameters for π 0 , and π 1 in the Appendix. The results are tabulated in Table 5, showing that the MPM model is insensitive to different specification of the hyper-parameters. Following Li et al. 25 , we assess the convergence of the MCMC chains by the potential scale reduction factor (PSRF) 27;28 . PSRF values close to 1 indicate that chains converge to the stationary distribution. Gelman et al. 29 recommend using PSRF≤ 1.1 as the cutoff for convergence, which has been adopted in our study. We compute the PSRF for each parameter and find the convergence of all chains after the burn-ins. For the purpose of demonstration, Figure 2 shows the pattern of PSRF the first five groups of coefficients in Example 1 under Error 2. The figure clearly shows the convergence of the proposed Gibbs sampler.

Nurses' Health Study (NHS) is one of the largest investigations into the risk factors for major chronic diseases in women. As part of the Gene Environment Association Studies initiative (GENEVA), the NHS provides SNP genotypes data as well as detailed information on dietary and lifestyle variables. Obesity level is one of the most important risk factors for Type 2 diabetes mellitus (T2D), a chronic disease due to both genetic and environmental factors. In this study, we analyze the NHS type 2 diabetes data to identify main and interaction effects associated with obesity. We use weight as the response and focus on SNPs on chromosome 10. We consider five environment factors, including the total physical activity (act), glycemic load (gl), cereal fiber intake (ceraf), alcohol intake (alcohol) and a binary indicator of whether an individual has a history of high cholesterol (chol). All these environmental exposures have been suggested to be associated with obesity and diabetes 30 . In addition, we include three clinical covariates: height, age and a binary indicator of whether an individual has a history of hypertension (hbp). In NHS study, about half of the subjects are diagnosed of type 2 diabetes and the other half are controls without the disease. We only use health subjects in this study. After cleaning the data through matching phenotypes and genotypes, removing SNPs with minor allele frequency (MAF) less than 0.05 or deviation from Hardy-Weinberg equilibrium, the working dataset contains 1732 subjects with 35099 SNPs. For computational convenience prescreening can be conducted to reduce For computational convenience prescreening can be conducted to reduce the feature space to a more attainable size for variable selection. For example, Li et al. 25 and Wu et al. 31 use the single SNP analysis to filter SNPs in a GWA study before downstream analysis. In this study, we use a marginal linear model with weight as the response variable to evaluate the penetrance effect of a variant under the environmental exposure. The marginal linear model uses a group of genetic main effect and G×E interactions corresponding to a SNP as the predictors, and test whether this SNP has any effect, main or G×E interaction. The SNPs with p-values less than a certain cutoff (0.001) for any effect, main or interaction, from the test are kept. 253 SNPs pass the screening.

The proposed approach RBSG-SS identifies 22 main SNP effects and 45 G×E interactions. The detailed estimation results are provided in Table 12 in the Appendix. We observe that the proposed method identifies main and interaction effects of SNPs with important implications in obesity. For example, two important SNPs, rs6482836 and rs10741150, that located within gene DOCK1 are identified. DOCK1 (Dedicator Of Cytokinesis 1) has been reported as a putative candidate for obesity related to adiponectin and triceps skinfold by previous studies 32; 33 . RBSG-SS identifies the main effect of rs6482836 and its interaction with the E factor act. Physical activity plays an important role in the prevention of overweight and obese 34 . This result suggests that the expression level of DOCK1 in an individual may influence the effect of physical activity in obesity prevention. RBSG-SS also identifies the interaction between rs10741150 and the E factor chol, suggesting that the effect of cholesterol level can be mediated by DOCK1. Interestingly, a previous study has shown that the expression level of DOCK5, an important paralog of DOCK1, is increased in individuals exposed to a diet high in saturated fatty acids 35 . Our results provide more evidence of the importance of DOCK1 in diet-induced obesity. Another example is the SNP rs11196539, located within gene NRG3. NRG3(Neuregulin 3) has been found to be associated with both the basal metabolic rate (BMR) and body mass index (BMI) 36 . RBSG-SS identifies its interaction with the E factors, gl and alcohol. Both glycemic load and alcohol intake are important dietary variables associated with obesity. The continued intake of high-glycemic load meals leads to an increased risk of obesity 37 . The increasing alcohol consumption is associated with a decline in body mass index in women 38 , however, heavy drinking can increase risk of the metabolic syndrome 39 . Our results suggest that further investigation of NRG3 may help explain the mechanism of the effects of glycemic load and alcohol intake on obesity. For the environment main effects, two E factors, chol and gl, have positive coefficients, and the other three, act, ceraf and alcohol, have negative coefficients, which are consistent with findings in the previous literature.

In addition to the proposed approach, we also conduct analysis using the alternatives RBL-SS, BSG-SS and BL-SS. As other alternative methods show inferior performance in simulation, they are not considered in real data analysis. Detailed estimation results are provided in Table 13 , 14 and 15 in the Appendix. In Table 3 , we provide the numbers of identified main and interaction effects with pairwise overlaps, to show the difference in terms of identification between the proposed method and the others. To further investigate the biological similarity of the identified genes, we conduct the Gene Ontology (GO) analysis. We can find an obvious difference between the proposed RBSG-SS and the three alternatives. The GO analysis results are provided in Figure 3 .

With real data, it is difficult to assess the selection accuracy objectively. The prediction performance may provide additional information to the selection results. Following Yan and Huang 40 and Li et al. 25 , we refit the models identified by RBSG-SS and RBL-SS using the robust Bayesian Lasso, and refit the models selected by BSG-SS and BL-SS using the Bayesian Lasso. For robust methods, the prediction mean absolute deviations (PMAD) are computed based on the posterior median estimates. The PMADs are 8.64 and 8.88 for RBSG-SS and RBL-SS, respectively. The proposed method outperforms the competitors. For non-robust methods, the prediction mean squared errors, or PMSEs, are 128.39 and 137.77 for BSG-SS and BL-SS, respectively. Overall, the superior performance of RNSG-SS over the alternatives can be observed.

In this case study, we analyze the Cancer Genome Atlas (TCGA) skin cutaneous melanoma (SKCM) data. TCGA is a collaborative effort supported by the National Cancer Institute (NCI) and the National Human Genome Research Institute (NHGRI), and has published high quality clinical, environmental, as well as multi-omics data. For this study, we use the level-3 gene expression data of SKCM downloaded from the cBio Cancer Genomics Portal 41 . Our goal is to identify genes that have genetic main effect or G×E interaction effects on the Breslow' thickness, an important prognostic variable for SKCM 42 . The log-transformed Breslow's depth is used as the response variable and four E factors are considered, age, AJCC pathologic tumor stage, gender and Clark level. Data are available on 294 subjects and 20,531 gene expressions. We adopt the same screening method used in the first case study to select 109 genes for further analysis.

The proposed approach RBSG-SS identifies 16 main SNP effects and 32 G×E interactions. The detailed estimation results are available from Table 16 in the Appendix. One important gene identified is CXCL6 (C-X-C Motif Chemokine Ligand 6), a chemokine with neutrophil chemotactic and angiogenic activities. It has been reported that CXCL6 plays an important role in melanoma growth and metastasis 43 . RBSG-SS identifies its main effect and its interactions with E factors, stage and Clark level. This suggests that CXCL6 can have different effects at different stages of melanoma. Another important finding is the gene MAGED4, one of member in MAGE(Melanoma-associated antigen) family. MAGE family contains genes that are highly attractive targets for cancer immunotherapy 44 . MAGED4 has been found to be an potential target for glioma immunotherapy 45 . RBSG-SS identifies the main effect of MAGED4 and its interaction with the E factor tumor stage, suggesting that MAGED4 may also play an important role in SKCM and its effect may change over different tumor stages. For the main effects of the E factors, Clark level and tumor stage have positive coefficients, and age and gender have negative coefficients, which match observations in the literature.

Analysis is also conducted via the three alternative methods, and the results are summarized in Table 3 . Detailed estimation results are provided in Table 17 , 18 and 19 in the Appendix. Again, the proposed RBSG-SS identifies different sets of main and interaction effects from the rest. We further investigate the biological similarity of the identified genes by GO analysis (Figure 3) , which suggests an obvious difference. Prediction performance is also evaluated. The PMADs are 0.69 and 0.83 for RBSG-SS and RBL-SS, respectively. The proposed approach again has better prediction performance than RBL-SS. The PMSEs are 0.93 and 1.05 for BSG-SS and BL-SS, respectively. Combined, the RBSG-SS outperforms the alternatives. 

In this study, we have developed robust Bayesian variable selection methods for geneenvironment interaction studies. The robustness of our methods comes from Bayesian formulation of LAD regression. In G×E studies, the demand for robustness arises in heavy-tailed distribution/ data contamination in both the response and predictors, as well as model misspecification. We have focused on the first case, which is frequently encountered in practice. Investigations of the robust Bayesian methods accommodating the other two cases are interesting and will be pursued in the future.

Recently, penalization has emerged as a power tool for dissecting G×E interactions 8 . Our literature review suggests that Bayesian variable selection methods, although tightly related to penalization, has not been fully explored for interaction analyses, let alone the robust ones. We are among the first to conduct robust G×E analysis within the Bayesian framework. The proposed Bayesian LAD sparse group LASSO are not only specifically tailored for G×E studies, and but also generally applicable for problems incorporating the bi-level structure in a broader context, such as simultaneously selection of prognostic genes and pathways 46;47 . The spike-and-slab priors have been incorporated to further improve identification and prediction performances. As a byproduct, the Bayesian LAD LASSO and group LASSO, both with spike-and-slab priors, have also been investigated for the first time. The computational feasibility of the Gibbs samplers is guaranteed by the R package roben, with the core modules of the MCMC algorithms developed in C++.

In G×E studies, the form of interaction effects can be linear, nonlinear, and both linear and nonlinear, resulting in parametric 13;48;49 , nonparametric 25;50;51 and semiparametric variable selection methods 31;52;53 to dissect G×E interactions, respectively. The proposed study can be potentially generalized to these studies within robust Bayesian framework. For example, variable selection for multiple semiparametric G×E studies can be formulated as a combination of individual and group level selection problem, where the robust Bayesian methods based on sparse group, group and individual level selection are directly applicable. The proposed robust Bayesian framework has paved the way for the future investigations. A Summary of methods. Note: The models in the references are modified to be applicable to G×E settings.

We demonstrate the sensitivity of RBSG-SS for variable selection to the choice of the hyperparameters for π 0 , and π 1 . We consider five different Beta priors: (1) Beta(0.5, 0.5) which is a U-shape curve between (0, 1); (2) Beta(1, 1) which is a essentially a uniform prior; (3) Beta(2, 2) which is a quadratic curve; (4) Beta(1, 5) which is highly right-skewed; (5) Beta(5, 1) which is highly left-skewed. As a demonstrating example, we use the same setting of Example 1 to generate data under the Error 2. Table 5 shows the identification performance of the median thresholding model (MPM) with different Beta priors. For all choices of Beta priors, the MPM model is very stable. Also, RBSG-SS correctly identifies most of the true effects with low false positives in all cases. Therefore, we simply use Beta(1, 1) as the prior for π 0 , and π 1 in this study. 

We carried out an examination of the Gene Ontology (GO) biological processes which provide us with a deeper insight on the differences of the markers identified by different methods. We totally identified 77 unique genes using our proposed method along with three other methods for the NHS data. We conducted the GO enrichment analysis using the R package GOSim and found these genes involve in a total of 158 GO biological processes, the p-values of which are smaller than 0.1 in the GO enrichment analysis. Then we divided the 158 processes into four categories: positive regulation (P), negative regulation (N), regulation (R, without a well-defined "direction") and other (O). We computed the proportions of genes that involve in the four categories of processes for each of the four methods. Similarly, for the TCGA SKCM data, 109 genes were identified by our method along with three other alternative methods. GO enrichment analysis showed that they involve in 183 biological processes, with p-values smaller than 0.1. The results for NHS and TCGA SKCM are provided in Figure 3 , which shows an obvious difference between our proposed method and the three alternatives in both datasets. 

• u i |rest ∼ Inverse-Gaussian(µ u i , λ u i ), where the shape parameter λ u i = 2ν, mean pa-

• ν|rest ∼ Gamma (s ν , r ν ), where the shape parameter s ν = c 1 + 3n 2 and the rate param-

• The posterior of s j is

• η|rest ∼ Gamma (s η , r η ), where s η = p+p×L 2 + d 1 and the rate parameter r η = p j=1 s j 2

G.2 RBL-SS G.2.1 Hierarchical model specification 

• u i |rest ∼ Inverse-Gaussian(µ u i , λ u i ), where the shape parameter λ u i = 2ν, mean pa-

• ν|rest ∼ Gamma (s ν , r ν ), where the shape parameter s ν = c 1 + 3n 2 and the rate param-

• The posterior of s jl is

• η|rest ∼ Gamma (s η , r η ), where s η = p×L+d 1 and the rate parameter r η = j,l s jl 2 +d 2 .

G.3 RBSG G.3.1 Hierarchical model specification

• u i |rest ∼ Inverse-Gaussian(µ u i , λ u i ), where the shape parameter λ u i = 2ν, mean pa-

• ν|rest ∼ Gamma (s ν , r ν ), where the shape parameter s ν = c 1 + 3n 2 and the rate param-

• r −1 j |rest ∼ Inv-Gaussian(η 1 ,

• η 1 |rest ∼ Gamma (s η 1 , r η 1 ), where s η 1 = p 2 +1 and the rate parameter r η 1 = p j=1 r j 2 +d 1 .

• η 2 |rest ∼ Gamma (s η 2 , r η 2 ), where s η 2 = p × L + 1 and the rate parameter r η 2 = j,l ω jl 2

G.4 RBG G.4.1 Hierarchical model specification

• u i |rest ∼ Inverse-Gaussian(µ u i , λ u i ), where the shape parameter λ u i = 2ν, mean pa-

• ν|rest ∼ Gamma (s ν , r ν ), where the shape parameter s ν = c 1 + 3n 2 and the rate param-

• η|rest ∼ Gamma (s η , r η ), where s η = p+p×L 2 + d 1 and the rate parameter r η = p j=1 s j 2 + d 2 .

G.5 RBL G.5.1 Hierarchical model specification 

• u i |rest ∼ Inverse-Gaussian(µ u i , λ u i ), where the shape parameter λ u i = 2ν, mean pa-

• ν|rest ∼ Gamma (s ν , r ν ), where the shape parameter s ν = c 1 + 3n 2 and the rate param-

G.6 BSG-SS G.6.1 Hierarchical model specification

andỹ ijl is defined asỹ ijl = y i − W i α − E i θ − U (jl) β (jl) .

• s 2 |rest ∼ Inv-Gamma 1 + 1 2 j,l I {ω jl =0} , η + 1 2 j,l ω 2 jl • π 0 |rest ∼ Beta a 0 + p j=1 I {β j =0} , b 0 + p j=1 I {β j =0}

• π 1 |rest ∼ Beta a 1 + j,l I {ω jl =0} , b 1 + j,l I {ω jl =0}

• η is estimated with the EM approach used in the proposed method. For the gth EM update

.

• σ 2 |rest ∼ Inv-Gamma( n 2 ,

G.7 BGL-SS G.7.1 Hierarchical model specification 

• β j |rest ∼ l j N(µ β j , σ 2 Σ β j ) + (1 − l j )δ 0 (β j ) where

• The posterior of s j is

Inverse-Gaussian(η,

• η|rest ∼ Gamma (s η , r η ), where s η = p+p×L 2 + d 1 and the rate parameter r η = p j=1 s j 2 + d 2 .

• σ 2 |rest ∼ Inv-Gamma( 

• β jl |rest ∼ l jl N(µ β jl , σ 2 β jl ) + (1 − l jl )δ 0 (β jl ) where

• The posterior of s jl is

Inverse-Gaussian(η, ησ 2 β 2 jl ) if β jl = 0

• π 1 |rest ∼ Beta a 1 + j,l I {β jl =0} , b 1 + j,l I {β jl =0}

• η|rest ∼ Gamma (s η , r η ), where s η = p×L+d 1 and the rate parameter r η = j,l s jl 2 +d 2 .

• σ 2 |rest ∼ Inv-Gamma( G.9 BSG G.9.1 Hierarchical model specification 

andỹ ij is defined asỹ ij = y i − W i α − E i θ − U (j) β (j) .

• r −1 j |rest ∼ Inv-Gaussian(η 1 ,

• ω −1 jl |rest ∼ Inv-Gaussian(η 2 , η 2 σ 2 β 2 jl )

• η 1 |rest ∼ Gamma (s η 1 , r η 1 ), where s η 1 = p 2 +1 and the rate parameter r η 1 = p j=1 r j 2 +d 1 .

• η 2 |rest ∼ Gamma (s η 2 , r η 2 ), where s η 2 = p × L + 1 and the rate parameter r η 2 = j,l ω j 2 + d 2 .

• σ 2 |rest ∼ Inv-Gamma( n+p×L 2 , n i=1ỹ i

G.10 BGL G.10.1 Hierarchical model specification 

andỹ ij is defined asỹ ij = y i − W i α − E i θ − U (j) β (j) .

• s −1 j |rest ∼ Inverse-Gaussian(η,

• η|rest ∼ Gamma (s η , r η ), where s η = p+p×L 2 + d 1 and the rate parameter r η = p j=1 s j 2 + d 2 .

• σ 2 |rest ∼ Inv-Gamma( n+p×L 2 , n i=1ỹ i

G.11 BL G.11.1 Hierarchical model specification 

• s −1 j |rest ∼ Inverse-Gaussian(η, ησ 2 β 2 jl )

• η|rest ∼ Gamma (s η , r η ), where s η = p×L+d 1 and the rate parameter r η = j,l s jl 2 +d 2 .

• σ 2 |rest ∼ Inv-Gamma( n+p×L

Gene-environment interactions in human diseases

Review of the geneenvironment interaction literature in cancer: What do we know?

A comprehensive review of genetic association studies

Genetic association studies: an information content perspective

Testing Gene-Environment Interaction in Large-Scale Case-Control Association Studies: Possible Choices and Comparisons

roben: Robust Bayesian Variable Selection for Gene-Environment Interactions

Gibbs sampling methods for bayesian quantile regression

A three-parameter asymmetric laplace distribution and its extension

The bayesian lasso

Optimal predictive model selection

Bayesian group lasso for nonparametric varying-coefficient models with application to functional genome-wide association studies

Gene-centric genomewide association study via entropy

Inference from iterative simulation using multiple sequences

General methods for monitoring convergence of iterative simulations

Bayesian Data Analysis

Diet, lifestyle, and the risk of type 2 diabetes mellitus in women

Integrative analysis of gene-environment interactions under a multi-response partially linear varying coefficient model

Linkage and association analysis of obesity traits reveals novel loci and interactions with dietary n-3 fatty acids in an alaska native (yup'ik) population

Newly identified set of obesity-related genotypes and abdominal fat influence the risk of insulin resistance in a korean population

Physical activity and obesity prevention: a review of the current evidence

Novel association approach for variable number tandem repeats (VNTRs) identifies DOCK5 as a susceptibility gene for severe obesity

Genome-wide association study for the interaction between BMR and BMI in obese Korean women including overweight

Glycemic index and obesity

Alcohol consumption, metabolic cardiovascular risk factors and hypertension in women

Prospective study of alcohol consumption and metabolic syndrome

Model selection for cox models with time-varying coefficients

The cbio cancer genomics portal: An open platform for exploring multidimensional cancer genomics data

Breslow thickness and clark level in melanoma

Isotypic neutralizing antibodies against mouse gcp-2/cxcl6 inhibit melanoma growth and metastasis

Maged4 expression in glioma and upregulation in glioma cell lines with 5-aza-2'-deoxycytidine treatment

Melanoma-associated antigen genes -an update

Identification of prognostic genes and pathways in lung adenocarcinoma using a bayesian approach

Identification of genes associated with cancer progression and prognosis in lung adenocarcinoma: Analyses based on microarray from oncomine and the cancer genome atlas databases

Histopathological imaging-environment interactions in cancer modeling

Penalized variable selection for lipid-environment interactions in a longitudinal lipidomics study

A novel method for identifying nonlinear gene-environment interactions in case-control association studies

Additive varying-coefficient model for nonlinear gene-environment interactions

A penalized robust semiparametric approach for gene-environment interactions

Semiparametric bayesian variable selection for gene-environment interactions

Bayesian hierarchical structured variable selection methods with application to molecular inversion probe studies in breast cancer

Penalized regression, standard errors, and bayesian lassos