key: cord-0699166-bqjx7yyp
authors: Wang, A.; Liu, Z.
title: A Two-Sample Robust Bayesian Mendelian Randomization Method Accounting for Linkage Disequilibrium and Idiosyncratic Pleiotropy With Applications to the COVID-19 Outcome
date: 2021-03-05
journal: nan
DOI: 10.1101/2021.03.02.21252801
sha: 2fc0bf68d1e18b455cd07c6df322abb9cff4d9e3
doc_id: 699166
cord_uid: bqjx7yyp

Mendelian randomization (MR) is a statistical method exploiting genetic variants as instrumental variables to estimate the causal effect of modifiable risk factors on an outcome of interest. Despite wide uses of various popular two-sample MR methods based on genome-wide association study summary level data, however, those methods could suffer from potential power loss or/and biased inference when the chosen genetic variants are in linkage disequilibrium (LD), and have relatively large direct effects on the outcome whose distribution might be heavy-tailed which is commonly referred to as the idiosyncratic pleiotropy. To resolve those two issues, we propose a novel Robust Bayesian Mendelian Randomization (RBMR) model that uses the more robust multivariate generalized t-distribution to model such direct effects in a probabilistic model framework which can also incorporate the LD structure explicitly. The generalized t-distribution can be represented as a Gaussian scaled mixture so that our model parameters can be estimated by the EM-type algorithms. We compute the standard errors by calibrating the evidence lower bound (ELBO) using the likelihood ratio test. Through extensive simulation studies, we show that our RBMR has robust performance compared to other competing methods. We also apply our RBMR method to two benchmark data sets and find that RBMR has smaller bias and standard errors. Using our proposed RBMR method, we found that coronary artery disease (CAD) is associated with increased risk of coronavirus disease 2019 (COVID-19). We also develop a user-friendly R package RBMR for public use.

Mendelian randomization (MR) is a popular statistical method that uses genetic variants as instrumental variables (IVs) for assessing the causal effect of a modifiable risk factor on a health outcome of interest even in the presence of unmeasured confounding factors (Ebrahim and Smith, 2008; Lawlor et al., 2008; Evans and Davey Smith, 2015) . Because of the inborn nature of genetic variants, the associations between genetic variants and phenotypes after adjusting for possible population stratification will not be confounded by the environmental factors, socio-economic status and life styles after birth. Genome-wide association studies (GWAS) have identified tens of thousands of common genetic variants associated with thousands of complex traits and diseases (MacArthur et al., 2017) . Those GWAS summary level data contain rich information about genotype-phenotype associations (https://www.ebi.ac.uk/gwas/), and thus provide us valuable resources for MR studies. Therefore, we have seen a boost of two-sample MR method developments and applications based on GWAS summary statistics recently due to the increasing availability of candidate genetic variant IVs for thousands of phenotypes. (Burgess et al., 2013; Bowden et al., 2015; Pickrell et al., 2016) . In particular, a genetic variant serving as a valid IV must satisfy the following three core assumptions (Martens et al., 2006; Lawlor et al., 2008) :

1. Relevance: The genetic variant must be associated (not necessarily causally) with the exposure; 2. Effective Random Assignment: The genetic variant must be independent of any (measured or unmeasured) confounders of the exposure-outcome relationship;

3. Exclusion Restriction: The genetic variant must affect the outcome only through the exposure, that is, the genetic variant must have no direct effect on the outcome not mediated by the exposure.

When these three core IV assumptions hold, the inverse variance weighted (IVW) (Ehret et al., 2011) method can be simply used to obtain unbiased causal effect estimate of the exposure on the outcome. However, among those three core assumptions, only the IV relevance assumption can be empirically tested, for example, by checking the empirical association strength between the candidate IV and the exposure using the GWAS catalog (https://www.ebi.ac.uk/gwas/). The association between the IV and the exposure must be strong enough (the IV explains a large amount of the variation of the exposure variable) to ensure unbiased causal effect estimate. The problem of weak IVs has been studied previously in the econometric literature (Bound et al., 1995; Hansen et al., 2008) . In MR settings, the method that uses genetic score by combining multiple weak IVs together to increase the IV-exposure association strength to reduce weak IV bias has also been proposed (Evans et al., 2013) . Unfortunately, the other two IV core assumptions cannot be empirically tested and might be violated in practice. Violation of the effective random assignment assumption can occur in the presence of LD. Violation of the exclusion restriction assumption can occur when the genetic variant indeed has a non-null direct effect on the outcome not mediated by the exposure, referred to as systematic pleiotropy (Solovieff et al., 2013; Verbanck et al., 2018; Zhao et al., 2020b) . However, very often, genetic variants might have relatively large direct effects whose distribution exhibit heavy-tailed pattern, a phenomenon referred to as the idiosyncratic pleiotropy in this paper.

To address those possible violations of the IV core assumptions, many efforts have been made recently. The MR-Egger regression method introduced an intercept term to capture the presence of unbalanced systematic pleiotropy under the Instrument Strength Independent of Direct Effect (InSIDE) assumption (Bowden et al., 2015) . However, MR-Egger would be biased when there exists idiosyncratic pleiotropy. Zhu et al. (2018) proposed the GSMR method that removes suspected genetic variants with relatively large direct effects and also takes the LD structure into account by using the generalized least squares approach. However, removal of a large number of relatively large direct effects might lead to efficiency loss. Zhao et al. (2020b) proposed MR-RAPS to improve statistical power for causal inference and limit the influence of relatively large direct effects by using the adjusted profile likelihood and robust loss functions assuming that those SNP IVs are independent. However, this independent IV assumption might not hold in practice because SNPs within proximity tend to be correlated. Cheng et al. (2020) proposed a two-sample MR method named MR-LDP that built a Bayesian probabilistic model accounting for systematic pleiotropy and LD structures among SNP IVs. One drawback of the MR-LDP method is that it cannot handle relatively large direct effects well.

To overcome the limitations of those aforementioned methods, we propose a more robust method named 'Robust Bayesian Mendelian Randomization (RBMR)' accounting for LD, systematic and idiosyncratic pleiotropy simultaneously in a unified framework. Specif-ically, to account for LD, we first estimate the LD correlation matrix of SNP IVs and then explicitly include it in the model likelihood. To account for idiosyncratic pleiotropy, we propose to model the direct effects using the more robust multivariate generalized t-distribution (Arellano-Valle and Bolfarine, 1995; Frahm, 2004) which will be shown to have improved performance than using the Gaussian distribution when the idiosyncratic pleiotropy is present.

Moreover, this more robust distribution can be represented as a Gaussian scaled mixture to facilitate model parameter estimation using the parameter expanded variational Bayesian expectation maximization algorithm (PX-VBEM) which combines the VB-EM (Beal et al., 2003) and the PX-EM (Liu et al., 1998) together. We further calculate the standard error by calibrating the evidence lower bound (ELBO) according to a nice property of the likelihood ratio test (LRT). Both extensive simulation studies in Section 3 and analysis of two real benchmark data sets in Section 4 show that our proposed RBMR method outperforms competitors. We also find that coronary artery disease (CAD) is associated with increased risk of severe respiratory confirmed COVID-19 outcome.

Suppose that we have J possibly correlated genetic variants (for example, single-nucleotide polymorphisms, or SNPs ) G j , j = 1, 2, . . . , J, the exposure variable X, the outcome variable Y of interest and unknown confounding factors U . Let δ X and δ Y denote the effects of confounders U on exposure X and outcome Y respectively. The coefficients γ j (j = 1, 2, . . . , J) denote the SNP-exposure true effects. Suppose that all the IVs are valid, then the exposure can be represented as a linear structural function of the SNPs, confounders and an independent random noise term e X . The outcome can be represented as a linear structural function of the exposure, confounders and the independent random noise term e Y . The true effect size of the exposure on the outcome is denoted as β 0 . Then, we have the following linear structural equation models (Bowden et al., 2015) :

(2.1) Let Γ j (j = 1, 2, . . . , J) be the true effects of SNPs on the outcome. With valid IVs, we

where the coefficients α j (j = 1, 2, . . . , J) represent the direct effects of the SNPs on the outcome. Then we have

So far, many existing MR methods assign the Gaussian distribution on each direct effect α j , that is α ∼ N (0, σ 2 0 I J ) (Zhao et al., 2020b; Cheng et al., 2020; Zhao et al., 2020a) , where α = [α 1 , . . . , α J ] T is a J-dimensional vector of direct effects. However, real genetic data might contain some relatively large direct effects whose distribution can be heavy-tailed, and thus the Gaussian distribution might not be a good fit. Therefore, we propose to assign the multivariate generalized t-distribution on α (Arellano-Valle and Bolfarine, 1995; Kotz and Nadarajah, 2004) , which is a robust alternative to the Gaussian distribution (Frahm, 2004 ).

< l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t >

< l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t >

< l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e

< l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t >

The multivariate generalized t < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > t < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > -distribution：

< l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t >

< l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > | 2 ⇠ N 0, 2 < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > | 2 ⇠ N 0, 2 < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > Bayesian probabilistic model for GWAS summary statistics :

< l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t >

< l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > (Ehret et al., 2011; Bowden et al., 2015; Zhao et al., 2020b) , and the uncorrelated SNPs can be chosen by using a tool called LD clumping (Hemani et al., 2016; Purcell et al., 2007) , which might remove many SNP IVs and thus cause efficiency loss. To include more SNP IVs even if they are in LD, we need to account for the LD structure explicitly. To achieve this goal, we first use the LDetect method to partition the whole genome into Q blocks and then estimate the LD matrix Θ using the estimator Θ (k) (k = 1, 2, . . . , Q) first proposed by Rothman (2012) . Then, the distributions of γ and Γ are given by

are both diagonal matrices (Zhu and Stephens, 2017) .

To account for the presence of idiosyncratic pleiotropy, we propose to model the direct effects α using the more robust multivariate generalized t-distribution (Arellano-Valle and Bolfarine, 1995; Kotz and Nadarajah, 2004 ; Ala-Luhtala and Piché, 2016) whose density function is given by

where N (α|0, Σ/w) denotes the J-dimensional Gaussian distribution with mean 0 and covariance Σ/w, Σ = σ 2 0 I J is a J × J diagonal matrix, and G(w|α w , β w ) is the Gamma distribution of a univariate positive variable w referred to as a weight variable

where f denotes the Gamma function. When α w = β w = ν/2 in equation (2.8), the distribution in equation (2.7) reduces to a multivariate t-distribution, where ν is the degree 7 All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted March 5, 2021. ; https://doi.org/10.1101/2021.03.02.21252801 doi: medRxiv preprint of freedom. Gaussian scaled mixtures enable the use of general algorithms for statistical inference and facilitate the construction of our algorithm, PX-VBEM in Section 2.3.

Then we denote the distribution of the latent variable γ as

where σ 2 = σ 2 I J is a J × J diagonal matrix. By assuming that γ, α and w are latent variables, the complete data likelihood can be written as

(2.10)

The standard expectation-maximization (EM) algorithm (Dempster et al., 1977) is generally a common choice for finding the maximum likelihood estimate of the complete data likelihood. However, one major difficulty of optimizing the complete data likelihood is to calculate the marginal likelihood function which involves difficult integrations with respect to the distributions of the latent variables. In addition, EM algorithm suffers from slow convergence to the approximate solutions (Liu et al., 1998) . To address these numerical issues, we utilize an parameter expanded variational Bayesian expectation-maximum algorithm, namely, PX-VBEM , by replacing the EM algorithm in VB-EM (Beal et al., 2003) with PX-EM algorithm (Liu et al., 1998) to accelerate the speed of convergence. To start with, for the purpose of applying PX-EM algorithm, the distribution of γ in equation (2.5) can be extended as follows:

and we rewrite the complete data likelihood in equation (2.10) as:

where the expanded model parameters for RBMR are θ def = {β 0 , σ 2 0 , σ 2 , ζ}. Let q(γ, α, w) be a variational posterior distribution. The logarithm of the marginal likelihood can be 8 All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted March 5, 2021. ; https://doi.org/10.1101/2021.03.02.21252801 doi: medRxiv preprint decomposed into two parts,

(2.14)

Given that the L(q) is an evidence lower bound ( q (α j ) q(w).

(2.15) with the updating equations of parameters:

In the PX-VBM step, by setting the derivate of ELBO to zero, the model parameters θ can be easily obtained as:

where µ γ = (µ γ 1 , . . . , µ γ J ) T , µ α = (µ α 1 , . . . , µ α J ) T , S γ = diag σ 2 γ 1 , . . . , σ 2 γ J and S α = diag σ 2 α 1 , . . . , σ 2 α J . Finally, we use the updated model parameters θ to construct the evidence lower bound to check the convergence. Since we adopt PX-EM algorithm, the reduction step should be used to process the obtained parameters. More details can be found in the Supplementary Materials.

After obtaining an estimate of the causal effect, we further calculate the standard error according to the property of likelihood ratio test (LRT) statistics which asymptotically follows the χ 2 1 under the null hypothesis (Van der Vaart, 2000). We first formulate the statistical tests to examine the association between the risk factor and the outcome. the likelihood ratio test (LRT) statistics for the causal effect is given by: We utilize PX-VBEM algorithm to maximize the ELBO to get the θ and θ 0 instead of maximizing the marginal likelihood to overcome the computational intractability. Although PX-VBEM produces accurate posterior mean estimates (Blei et al., 2017; Dai et al., 2017; Yang et al., 2018) , PX-VBEM would underestimate the marginal variance, we directly use the estimated posterior distribution from ELBO to approximate marginal likelihood in equation

(2.20) (Wang and Titterington, 2005) . Thus, we calibrate ELBO by plugging our estimates ( θ and θ 0 ) from PX-VBEM into the equation (2.20) to construct the test statistics :

Then, we can get the standard error as se( β 0 )= β 0 / √ Λ .

To mimic real data settings, we simulate the individual-level data by the following models:

where X ∈ R n X ×1 is the exposure vector, Y ∈ R n Y ×1 is the outcome vector, G X ∈ R n X ×J and G Y ∈ R n Y ×J are the genotype datasets for the exposure X and the outcome Y, U X ∈ R n X ×N 0 and U Y ∈ R n Y ×N 0 are matrices for confounding variables, n X and n Y are the corresponding sample sizes of exposure X and outcome Y, J is the number of genotyped SNPs. ε X and ε Y are independent noises from N 0, σ 2 ε X I n X and N 0, σ 2 ε Y I n Y , respectively. In model (3.1), β 0 is the true causal effect and α exhibits the direct effect on the outcome. Since the RBMR model is designed to solve the problem of systematic and idiosyncratic pleiotropy accounting for the LD matrix, we simulate three cases of idiosyncratic pleiotropy to meet the actual situations based on systematic pleiotropy:

• case 1: Γ j = γ j β 0 + α j , α j i.i.d ∼ N (0, σ 2 0 ), j = 1, 2, . . . , 300. We randomly select {5%, 10%, 20%, 25%} IVs so that their direct effect α j s have mean 0 and variance 10σ 0 2 .

11 All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted March 5, 2021. ; https://doi.org/10.1101 https://doi.org/10. /2021 • case 2: Γ j = γ j β 0 + α j , α j i.i.d ∼ N (0, σ 2 0 ), j = 1, 2, . . . , 300. We randomly select {5%, 10%, 20%, 25%} IVs so that their direct effect α j s have variance σ 2 0 and mean 10σ 0 .

• case 3: Γ j = γ j β 0 + α j , α j i.i.d ∼ t(n), j = 1, 2, . . . , 300, the values of freedom n are {10, 15, 20}. σ 2 0 in case 1 and 2 is controlled by the heritability h α due to systematic pleiotropy, where

and there are two options for h 2 α : 0.05 and 0.07. Analogously, the signal magnitude for γ is set by controlling heritability h 2 γ = var(β 0 G X γ)

, which is fixed at 0.1.

What is more, to imitate the real applications, an external reference panel G r ∈ R nr×J is chosen to estimate the LD matrix among SNPs, where n r is the sample size of reference panel. The R package named MR.LDP is available to generate genotyped matrices G X , G Y and G r , we fix n X = n Y = 20000 and n r = 2500. The total number of SNPs is J = 300.

For confounding variables, each column of U X and U Y is sampled from a standard normal distribution while each row of corresponding coefficients η X ∈ R N 0 ×1 and η Y ∈ R N 0 ×1 of confounding variables is obtained from a multivariate normal distribution N (0, S η ) where diagonal elements of S η ∈ R 2×2 are 1 and the remaining elements are 0.8.

The true causal effect β 0 is 0.2. After conducting single-variant analysis, we can obtain the summary-level statistics { γ j , Γ j } j=1,2,...,300 with their standard errors { σ X j , σ Y j } j=1,2,...,300

for three cases, respectively. Repeat 100 times for each case according to the above simulations. Then we use the summary-level data to conduct point estimate analyses based on the simulation results obtained by the RBMR, MR-LDP, MR-Egger, RAPS, GSMR and IVW methods, respectively. As the prerequisite for MR-Egger, RAPS and IVW methods is that the instrumental variables are independent of each other, we adopt a step-wise GSMR method to remove SNPs with LD structure. 

In this section, we analyzed three real data sets to demonstrate the performance of our proposed method. We first analyze two benchmark data sets commonly used for method comparison purpose, which are about coronary artery disease (CAD) and body mass index (BMI), which are referred as CAD-CAD and BMI-BMI (body mass index) pairs coming from independent GWAS studies. Specifically, in these two real data sets, the GWAS summary statistics for each example (CAD or BMI) are derived from three data sets: selection, exposure and outcome (Zhao et al., 2020b) . Both the exposure data set and outcome data set are CAD or BMI which do not have any overlapping samples. For CAD-CAD and BMI-BMI analysis, the true causal inference should technically be β 0 = 1. In addition, 1000 Genome Project Phase 1 (N = 379) defined as 1KGP is used as a source of the reference panel samples to compute the LD matrix (Consortium et al., 2012) .

For CAD-CAD analysis, the selection, exposure and outcome data sets are all from the R package MR.LDP (https://github.com/QingCheng0218/MR.LDP). We use the Myocardial Infarction Genetics in the UK Biobank as the selection data set, the exposure data is Coronary Artery Disease (C4D) Genetics Consortium (Consortium et al., 2011) , and the transatlantic Coronary Artery Disease Genome Wide Replication and Meta-analysis (CAR-DIoGRAM) is chosen as the outcome dataset (Schunkert et al., 2011) . We first filter the genetic variants using the selection data under different association p-value thresholds (pvalue ≤ 1 × 10 −4 , 5 × 10 −4 , 1 × 10 −3 ). At the same time, the reference panel data 1KGP

is also used to calculate the LD matrix. We conduct GSMR method to obtain the nearindependent SNPs because IVW, MR-Egger and RAPS are designed for independent SNPs.

Then we use the remaining SNPs to perform MR analyses including the following methods:

RBMR, MR-LDP, RAPS, GSMR, MR-Egger, IVW. We obtain causal effect point estimates and the corresponding 95% confidence intervals (CI) as shown in Figure 4.1(a) . Obviously, RBMR is superior to other methods with the smallest bias and shorter confidence intervals for a range of p-value thresholds.

To further investigate the performance of our proposed method, we consider the case that both the exposure and outcome come from BMI. Specifically, we refer the BMI in European ancestry as the screening dataset (Locke et al., 2015) . The exposure data is from BMI for physical active men in European ancestry and the outcome data is from BMI for physical active women in European ancestry (https://portals.broadinstitute.org/ collaboration). The results of point estimates with the corresponding 95% confidence interval are shown in Figure 4 .1(b). We found that our proposed RBMR method has smaller bias than other competing methods. More numerical results are provided in the Supplementary Materials. The current coronavirus disease 2019 (COVID-19) pandemic was caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) (Li et al., 2020; Thomson, 2020) .

We apply our proposed RBMR method together with other competing methods to estimate the causal effects of CAD on the risk of severe COVID-19. Specifically, the selection dataset is the Myocardial Infraction Genetics in the UK Biobank and exposure dataset is from (Consortium et al., 2011) . The outcome is obtained from release 5 (January 2021) of the (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted March 5, 2021. 16 All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted March 5, 2021. ; https://doi.org/10.1101/2021.03.02.21252801 doi: medRxiv preprint piratory confirmed COVID-19 and 1,155,203 controls from the general populations in our analysis. We use the selection data under threshold (p-value ≤ 1 × 10 −4 ) to filter the genetic variants. As the assumptions of IVW, MR-Egger and RAPS indicate that the SNPs are independent or in weak LD, GSMR method is conducted to do SNP pruning. As shown in the Figure 4.2(a) , we find significant causal effects of CAD on COVID-19 using our RBMR method ( β = 0.2242, p-value = 0.0273, 95% CI = (0.0834, 0.3648)), MR-LDP ( β = 0.2197, p-value = 0.0325, 95% CI = (0.0773, 0.3621)), RAPS ( β = 0.1835, p-value = 0.0412, 95% CI = (0.0074, 0.3596)) and MR-Egger ( β = 0.2787, p-value = 0.012, 95% CI = ( 0.0607, 0.4967)). However, the results of GSMR ( β = 0.1343, p-value = 0.0558, 95% CI = (-0.0034, 0.2719)) and IVW ( β = 0.1343, p-value = 0.0700, 95% CI = (-0.0111, 0.2796)) should be abolished (p-value > 0.05). We can see that our method gives a more robust estimate than other competing ones. In contrast, MR-Egger might overestimate and the IVW, GSMR and RAPS might underestimate the true causal effect. Although MR-LDP and RBMR give similar point estimate, however, our RBMR is more accurate as its confidence interval is slightly shorter and its p-value is more significant.

In this paper, we propose a novel two-sample robust MR method RBMR by accounting for the LD structure, systematic pleiotropy and idiosyncratic pleiotropy simultaneously in a unified framework. Specifically, we propose to use the more robust multivariate generalized t-distribution rather the less robust Gaussian distribution to model the direct effects of the IV on the outcome not mediated by the exposure. Moreover, the multivariate generalized t-distribution can be reformulated as Gaussian scaled mixtures to facilitate the estimation of the model parameters using the parameter expanded variational Bayesian expectationmaximum algorithm (PX-VBEM). Through extensive simulations and analysis of two real benchmark data sets, we found that our method outperforms the other competing methods.

We also found that CAD is associated with increased risk of COVID-19 outcome using our RBMR method.

We make the following two major contributions. First, our method can account for the LD structure explicitly and thus can include more SNPs to reduce bias and increase estimation efficiency. Second, our RBMR method is more robust to the presence of idiosyncratic pleiotropy. One limitation of our proposed method is that it cannot handle correlated pleiotropy where the direct effect of the IV on the outcome might be correlated with the IV 17 All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted March 5, 2021. ; https://doi.org/10.1101/2021.03.02.21252801 doi: medRxiv preprint strength. We leave it as our future work.

Gaussian scale mixture models for robust linear multivariate regression with missing data

On some characterizations of the tdistribution

Variational algorithms for approximate Bayesian inference

Approximately independent linkage disequilibrium blocks in human populations

Variational inference: A review for statisticians

Problems with instrumental variables estimation when the correlation between the instruments and the endogenous explanatory variable is weak

Mendelian randomization with invalid instruments: effect estimation and bias detection through Egger regression

Mendelian randomization analysis with multiple genetic variants using summarized data

MR-LDP: a two-sample mendelian randomization for gwas summary statistics accounting for linkage disequilibrium and horizontal pleiotropy

An integrated map of genetic variation from 1,092 human genomes

A genome-wide association study in europeans and south asians identifies five new loci for coronary artery disease

IGESS: a statistical approach to integrating individual-level genotype data and summary statistics in genome-wide association studies

Maximum likelihood from incomplete data via the em algorithm

Mendelian randomization: can genetic epidemiology help redress the failures of observational epidemiology?

Genetic variants in novel pathways influence blood pressure and cardiovascular disease risk

Mining the human phenome using allelic scores that index biological intermediates

Mendelian randomization: new applications in the coming age of hypothesis-free causality

Generalized elliptical distributions: theory and applications

Estimation with many instrumental variables

MR-Base: a platform for systematic causal inference across the phenome using billions of genetic associations

The covid-19 host genetics initiative, a global initiative to elucidate the role of host genetic factors in susceptibility and severity of the sars-cov-2 virus pandemic

Multivariate t-distributions and their applications

Mendelian randomization: using genes as instruments for making causal inferences in epidemiology

Early transmission dynamics in wuhan, china, of novel coronavirus-infected pneumonia

Parameter expansion to accelerate em: the px-em algorithm

Genetic studies of body mass index yield new insights for obesity biology

The new NHGRI-EBI Catalog of published genome-wide association studies (gwas catalog)

Instrumental variables: application and limitations

Detection and interpretation of shared genetic influences on 42 human traits

PLINK: a tool set for wholegenome association and population-based linkage analyses

Positive definite estimators of large covariance matrices

Large-scale association analysis identifies 13 new susceptibility loci for coronary artery disease

Pleiotropy in complex traits: challenges and strategies

The covid-19 pandemic: A global natural experiment

Asymptotic Statistics

Detection of widespread horizontal pleiotropy in causal relationships inferred from mendelian randomization between complex traits and diseases

Inadequacy of interval estimates corresponding to variational bayesian approximations

LPG: A fourgroup probabilistic approach to leveraging pleiotropy in genome-wide association studies

CoMM-S2: a collaborative mixed model using summary statistics in transcriptome-wide association studies

Bayesian weighted mendelian randomization for causal inference based on summary statistics

Statistical inference in two-sample summary-data mendelian randomization using robust adjusted profile score

Bayesian large-scale multiple regression with summary statistics from genome-wide association studies

Causal associations between risk factors and common diseases inferred from gwas summary data