key: cord-0650652-bn9o1yw8 authors: Loewinger, Gabriel; Nunez, Rolando Acosta; Mazumder, Rahul; Parmigiani, Giovanni title: Optimal Ensemble Construction for Multi-Study Prediction with Applications to COVID-19 Excess Mortality Estimation date: 2021-09-19 journal: nan DOI: nan sha: 98e3d5b59135fa1e584c8af45781ae9ca9e722b7 doc_id: 650652 cord_uid: bn9o1yw8 It is increasingly common to encounter prediction tasks in the biomedical sciences for which multiple datasets are available for model training. Common approaches such as pooling datasets and applying standard statistical learning methods can result in poor out-of-study prediction performance when datasets are heterogeneous. Theoretical and applied work has shown $textit{multi-study ensembling}$ to be a viable alternative that leverages the variability across datasets in a manner that promotes model generalizability. Multi-study ensembling uses a two-stage $textit{stacking}$ strategy which fits study-specific models and estimates ensemble weights separately. This approach ignores, however, the ensemble properties at the model-fitting stage, potentially resulting in a loss of efficiency. We therefore propose $textit{optimal ensemble construction}$, an $textit{all-in-one}$ approach to multi-study stacking whereby we jointly estimate ensemble weights as well as parameters associated with each study-specific model. We prove that limiting cases of our approach yield existing methods such as multi-study stacking and pooling datasets before model fitting. We propose an efficient block coordinate descent algorithm to optimize the proposed loss function. We compare our approach to standard methods by applying it to a multi-country COVID-19 dataset for baseline mortality prediction. We show that when little data is available for a country before the onset of the pandemic, leveraging data from other countries can substantially improve prediction accuracy. Importantly, our approach outperforms multi-study stacking and other standard methods in this application. We further characterize the method's performance in simulations. Our method remains competitive with or outperforms multi-study stacking and other earlier methods across a range of between-study heterogeneity levels. Statistical methods to leverage information from different studies and sources are critical for generating prediction algorithms that are replicable across populations and experimental settings. The increasing availability of biomedical databases in, for example, neuroimaging (OpenNeuro (openneuro.org)), genetics (e.g., GWAS Catalogue (Buniello and others, 2019) ), HIV (Stanford HIV database (Ramon and others, 2019) ) and cancer genomics (Ganzfried and others, 2013; Edgar and others, 2002) has resulted in increased interest in the utilization of data from multiple sources when training prediction algorithms. This is further motivated by the observation that prediction performance assessed within-study (e.g., with cross validation) is often far more optimistic than the performance evaluated out-of-study (Ma and others, 2014; Chang and Geman, 2015; Bernau and others, 2014) . Despite promising initial progress, much remains to be investigated on how to combine information across datasets or "studies" in a manner that promotes model generalizability. A simple, but at times powerful approach is to merge datasets before model fitting. This can fail to capture, however, the heterogeneity across datasets and can therefore perform poorly when studies are heterogeneous. In this case different models for each study may lead to better fits, and their ensemble to better outof-study predictions (Patil and Parmigiani, 2018; Guan and others, 2020) . To account for the heterogeneity, one approach is to pre-process the studies with data harmonization techniques before model fitting (Da-ano and others, 2020; Zhang and others, 2020) . While this is generally helpful, it can be insufficient, and potentially remove genuine signal (Nygaard and others, 2015; Cetin-Karayumak and others, 2020). Multiple related sub-fields of transfer learning (Kouw and Loog, 2019) contain a rich body of methodologies geared towards the multi-study problem specifically as well as the "dataset shift" issue more generally. Broadly speaking, these methods aim to deal with variation in the distribution of the training and test dataset(s). "Concept shift" refers to changes in the conditional probability of labels given predictors, while "covariate shift," describes discrepancies in the joint distribution of the covariates between training and test data. "Hybrid shift" describes the case when both concept and covariate shift occur (Kouw and Loog, 2019; Yang and others, 2020) . Our motivating application can be described as a multi-source transfer learning problem (Chen, 2021; Kouw and Loog, 2019) . These describe settings where the modeler has a small sample of labeled training data drawn from the distribution of interest, but seeks to enhance prediction performance through borrowing labeled data from other related studies, or "domains." Typically these other auxiliary data sources share common covariates and outcomes. The distribution of these data are thought to be similar to the distribution of the data of the target study. This is closely related to multi-source domain adaptation which proposes strategies that use information from multiple datasets, but also leverage the covariates of the test set to tailor the model to the target domain (Kouw and Loog, 2019; Sun and others, 2015) . The domain generalization literature focuses on leveraging multiple datasets in model training to improve model generalizability, to enhance prediction performance on a new, unknown, but related, "domain" (Wang and others, 2021) . Our methods are inspired both by this vast literature in transfer learning as well as related methods in "multi-study" statistics that aim to draw upon multiple data sources in inference (Guo and others, 2021; Rashid and others, 2020) , supervised prediction (Loewinger and others, 2021; Guan and others, 2020; Ramchandran and others, 2020; Ren and others, 2020) , and unsupervised learning (De Vito and others, 2019; Roy and others, 2019). Previous work in this area proposed multi-study ensembling in conjunction with a generalization of stacking, an ensemble weight estimation method (Breiman, 1996) , as a flexible strategy to aggregate information from different studies (Patil and Parmigiani, 2018; Ramchandran and others, 2020; Loewinger and others, 2021; Ren and others, 2020; Guan and others, 2020) . The approach involves two separate stages: A) training one or more models on each study separately, and B) constructing an ensemble prediction rule that is a weighted average of the predictions from each of the study-specific models. The ensemble weights are estimated in step B through "multi-study stacking," (MSS) by regressing the outcome of all training studies against the predictions of each of these models, using all available training data sets. Heuristically, this method rewards cross-study prediction performance by determining the ensemble weight associated with a model fit on one study based upon how well that model predicts across all the training data sets. To formalize, we assume we have K training studies and we denote the outcome vector of training study k as y k ∈ R n k , and its design matrix as X k ∈ R n k ×(p+1) . In addition to p predictors, X k includes a column of ones for an intercept. We let N = K k=1 n k . Multi-study stacking as originally proposed consists of two stages. In stage A) we separately fit modelsŶ k (·) on data from the k th study, for every k. In stage B) we compute predictions made by these models at the observed covariates of study k asŶ k (X k ) for every k . Define y andX as follows: We then fit a stacking regression to predict y givenX using, for example, non-negative least squares (NNLS) or Ridge regression, typically with an intercept term. The resulting coefficients serve as the weights of the ensemble prediction and are denoted by the p + 1 dimensional vectorŵ. At data point x the prediction rule isŶ Stack (x) = w 0 + K k=1ŵ kŶk (x). Theoretical work has described settings in which this type of ensembling is expected to outperform merging data sets before model fitting here labeled "Trained-on-the-Merged dataset" (ToM). For example Guan and others (2020) compare the ToM approach to ensembling when the learner is either linear or Ridge regression, under a linear mixed effects data generating model where true regression coefficients vary with k. They show that when between-study heterogeneity, as measured by the variance of the random regression coefficients, is sufficiently high, ensembling is expected to outperform ToM. Subsequent work generalized multi-study ensembling and formalized two approaches: generalist stacking (described above) for domain generalization settings, in which we aim to train models that generalize to an unseen study, and specialist stacking for multi-source transfer learning settings, in which we aim to tailor the ensemble to predict well on a target study for which labeled data is available at the time of model training (Ren and others, 2020) . For example, say we wish to engineer the ensemble to predict well on future data that follows the same distribution as data from training study k * , where k * ∈ [K] and [K] denotes the set {1, 2, ..., K}. This can be accomplished by regressing . By setting up the stacking regression in this manner, specialist stacking estimates ensemble weights appropriate for data similar to training set k * . Similar architecture for stacking in transfer learning settings has been proposed previously (Yuankai, 2019) . Ren and others (2020) further proposed a "no data reuse" (NDR) specialist stacking procedure that excludes data used in stage A from the stage B stacking. Briefly, this is accomplished by dividing training data for the target study into L folds. Taking L = 2 for ease of explanation, let the data in the two folds be {y 1 k * , X 1 k * } and {y 2 k * , X 2 k * }. Then train modelŶ 1 k * () on {y 1 k * , X 1 k * } and a separate model,Ŷ 2 k * (), on {y 2 k * , X 2 k * }. Then when fitting the stacking regression, data reuse is avoided by usingŶ 2 k * () to make predictions on {y 1 k * , X 1 k * } and vice-versa. Specifically, the stacking procedure involves regressing y k * ontoX k * where In practice this is implemented with more than two folds. After the estimation of ensemble weights, a third stage is implemented in which a final model for study k * , Y k * (), is fit using {y k * , X k * } (i.e., all available data for that study). This model is then used in the final prediction rule,Ŷ Stack (x) =ŵ 0 + K k=1ŵ kŶk (x) In this paper we propose a simpler version of no data reuse specialist stacking, which does not require splitting the data into folds, and is amenable to a simple optimization formulation. In stage A), this algorithm trains single-study learners on all studies except the target study k * . In stage B), it only uses the target training data {y k * , X k * } by regressing y k * onto the matrix. Stage A) builds an ensemble that captures the heterogeneity of data distributions. Stage B) specializes the weights of the ensemble to perform well on the target study. We refer to this variant as "no-data-reuse specialist stacking" throughout the manuscript. The multi-study ensembling procedures described above are flexible and easy to implement, but one potential shortcoming is that ensemble weights and study-specific model parameters are estimated separately in each stage. These approaches may therefore lose efficiency because the ensemble properties are ignored in stage A). The main thrust of this work is to integrate both stages into a single and explicit optimization framework. We propose a method based on multi-study ensembling and stacking that seeks to improve prediction performance by jointly optimizing model parameters and ensemble weights. We term this "all-in-one" approach as optimal ensemble construction (OEC). We formalize this below as a convex combination of the sum of study-specific model losses (e.g., study-specific mean square errors) and the stacking regression loss function. As an initial foray into this area, our work investigates linear models. We show that in this case the OEC yields standard stacking and the ToM as special cases of our framework. While we focus on linear models in this work, we note that our framework is general and may be applied to other predictive models as well. Our work was initially motivated by the estimation of COVID-19 related mortality. International comparisons of COVID-19 specific death counts suffer from important limitations. For example, due to health systems heterogeneity, the criteria for what constitutes a COVID-19 death is different across countries (Karanikolos and others, 2020) . In part, these criteria depend on the testing capacity of the jurisdiction, which can be sub-optimal in underdeveloped countries (Max Roser and Hasell, 2020) . Hence, excess mortality is regarded as the gold-standard method of estimating the direct and indirect effects of the COVID-19 pandemic (Beaney and others, 2020) and is calculated by subtracting the number of deaths that would have been expected to occur in a locality (e.g., based on historical trends) from the number of observed deaths. In many countries, infrastructure to monitor mortality before and during a disaster is not yet fully established (Karlinsky and Kobak, 2021; Fottrell, 2009) . Such countries may have limited data to estimate baseline mortality before the onset of a disaster, such as a pandemic, and hence the resulting estimates of excess mortality may be unreliable. For instance, Karlinsky and Kobak (2021) contacted officials in a host of countries to collect international mortality data to produce a database for the comparison and assessment of public health strategies. Many officials indicated that no data or very limited data was available prior to the onset of COVID-19. Although only a minority of nations world-wide provided any public data, Karlinsky and Kobak (2021) quoted responses from health officials in countries such as Argentina, China, India, Liberia and Vietnam to emphasize that many countries do not have the resources to collect sufficient mortality data. The following quote from a Liberian health official speaks volumes about this ongoing challenge (Karlinsky and Kobak, 2021) : "Unfortunately, we do not have a mechanism in place at the moment to capture routine mortality data in-country nation-wide...As you may also be aware, death or mortality registration or reporting is yet a huge challenge in developing countries..." Even among countries that collected pre-pandemic data, samples were often limited to as little as one year of prepandemic data (e.g., South Africa) (Karlinsky and Kobak, 2021; Bernard and others) . Even further, some countries only provided quarterly or monthly data (e.g., Iran, Taiwan) which may not provide the temporal resolution needed for precise assessment of public health responses. This motivates the implementation of multi-study methods to borrow information across countries to create better baseline mortality estimates. Leveraging information from other countries' data may improve baseline mortality estimates on the target country because shared latent factors (e.g., geographic, health systems, weather patterns) may have similar impacts on mortality trends across countries. While these factors may not be comparable across all observed countries, there may exist clusters of countries across which borrowing information is useful. Identifying a priori which countries to borrow information from may not be reliable, as even natural heuristics such as geographic proximity may be unreliable. Thus we sought to develop methods that could identify which countries would provide auxiliary data that could improve the estimation of baseline mortality for a target country in a data-driven fashion. To this end, we compare the prediction capabilities of the proposed methods against conventional approaches by using a multi-country dataset of weekly death counts for 38 countries (i.e., K = 38 studies) (Jdanov and others), where we aim to predict baseline mortality in a country of interest. We next build on the notation introduced earlier and specify distributional assumptions. We assume we have K observed studies available for model training. The pair of outcomes and covariates for study k is {y k , X k }. The merged dataset has outcome variable y ∈ R N and design matrix X ∈ R N ×(p+1) where, N = k n k . We assume that the outcome y k,i for observation i in study k is conditionally distributed as y k,i | x k,i ∼ f y k | x k (y k | X k ), and that marginally the covariates x k,i are distributed as x k,i ∼ f x k (X k ). We assume that the conditional distributions of the outcome given the covariates may differ across studies. We explore both settings where the marginal distributions of the covariates, f x k (X k ), differ across studies, and where they are approximately the same. In the two-stage multi-study stacking method discussed in Section 1.2, we distinguish between the model parameters (e.g., the coefficients in a linear model) and the ensemble weights used to aggregate predictions from different models. We refer to statistical learning methods (e.g., linear models, as this is our focus here) to be fit on a single study as single study learners (SSL). We denote the model parameters associated with a fit to the k th study as γ k ∈ R p+1 . For example, in the squared 2 penalized linear regression setting,γ k is the solution to the following where D r ∈ R r×r is a diagonal matrix with the (1,1)-th location equal to zero and all other diagonal entries equal to one to prevent regularization of the intercept. We denote the estimates of model parameters obtained from a fit to the merged dataset asγ ToM . For example, in the squared 2 penalized linear regression setting, γ ToM is a solution to: We refer to variations of two-stage multi-study methods as "generalist" or "specialist", depending on the goal. For specialist methods, we further indicate whether we used a "no data reuse" variation. We abbreviate the multi-study stacking generalist as MSS G , the specialist as MSS S and the no data reuse specialist as MSS SN . We denote the corresponding ensemble (stacking) weights associated with a model fit to study k as w G k , w S k or w SN k . We do not need to distinguish between model parameters for different MSS procedures because they are equal for all MSS methods; only ensemble weights, estimated in stage B), differ. We refer to the two-stage multi-study stacking methods collectively as MSS methods. Similarly, we collectively refer to the optimal ensemble approaches as OEC methods. We now introduce the OEC counterparts of MSS G , MSS S and MSS SN , which we refer to as OEC G , OEC S and OEC SN , respectively. We consider a single SSL per study. To avoid confusion with the two-stage multi-study stacking of section 1.2, we denote the model parameters associated with the SSL fit to data from study k as β G k , β S k , β SN k ∈ R p+1 and the ensemble weights by α G k , α S k and α SN k , respectively. We also include ensemble intercepts that we do not constrain to be non-negative which we denote α G 0 , α S 0 , α SN 0 . We denote the corresponding vectors, which include both the study-specific ensemble weights and the intercept, by α G , α S ∈ R p+1 , and α SN ∈ R p . We denote B G , B S ∈ R p+1×K and B SN ∈ R p×K to be matrices of study-specific model coefficients. For example, column k of B G is equal to β G k . The SN parameters α SN and B SN ∈ R p×K have dimension p rather than p + 1 as they do not contain components corresponding to study k * , as we will further clarify later. Here we focus on the linear-linear OEC, where the optimal coefficients for SSL and the ensemble weights are given by solutions to the following joint optimization formulations. Generalist OEC (OEC G ): We begin with the generalist OEC loss function, defined in the linear-linear case as where dimensions are as specified above, and where µ and λ k 's are regularization parameters for the corresponding regression models. The top term in (1) is the loss for the (penalized) non-negative least squares regression used to calculate ensemble weights, and the bottom term in (1) is the sum of study-specific Ridge-regularized least squares loss functions. The overall objective function is a convex combination of these two terms-the weight η controls whether the optimization procedure will seek better study-specific SSL fits (low η) or fits of the SSLs that better contribute to the ensemble as a whole (high η). Specialist OEC (OEC S ): The OEC S formulation is similar except that only data from the target study, k * , is included in the portion of the objective function involving ensemble weights. The resulting optimization formulation is No Data Reuse Specialist OEC (OEC SN ): Formulation (2) can be modified to yield the OEC SN which does not include a study-specific learner for the target study, k * and determines parameters and weights by solving: The parameters β SN k * and α SN k * are not included in the objective function or the decision variables. Throughout, we denote byα andβ, with appropriate subscripts and superscripts, solutions of these optimization problems. We obtain these estimates by applying the block coordinate descent method described in Supplemental Figure 7 to the corresponding optimization problem. The loss function associated with the OEC SN differs from the no data reuse specialist stacking objective proposed in Ren and others (2020) . We implemented this modification because the algorithm proposed by Ren and others (2020) does not lend itself to an OEC form that rules out data reuse. While the approach does properly exclude data reuse when implemented in a two-stage procedure, a direct generalization as a joint optimization approach would not prevent reuse because the different folds of the target study data, and the corresponding model coefficients, would be tied through the shared estimation of the ensemble weight w k * . Throughout the paper, we distinguish between generalist and specialist, algorithms. For this reason, when fitting a model or ensemble to make predictions on some new, unseen dataset, (denoted here as study K + 1), we compare the performances of MSS G , OEC G , and ToM. When we aim to make predictions on a new observation in a target study for which we have limited training data (study k * ), we compare the MSS S with the OEC S and the MSS SN with the OEC SN . We also compare these last four methods to SSM, a model fit only to the training data of the target study, to examine whether borrowing information is beneficial for prediction performance. The main methodological question we examine here is whether jointly optimizing ensemble weights and linear model parameters improves prediction performance above two-stage multi-study stacking. We examine this for generalist and specialist multi-study stacking. The possible ensembling methods can be described by a 2 × 2 table: OEC/MSS × specialist/generalist. For specialist stacking, we also examine performance with and without data reuse. The following observations provide insight into the properties of the OEC framework, as they show that, if both the SSL and the ensemble weight component of the loss are linear models with no regularization on model parameters or ensemble weights, then the ToM, SSM and two-stage multi-study stacking methods are special cases. We present the results in the OEC G , OEC S and OEC SN cases specifically. Proofs can be found in Supplemental section 12. The optimization problems corresponding to the OEC G , OEC S and OEC SN with no regularization on model parameters or ensemble weights are: Proposition 1 provides insights into the optimal solutions of the OEC by characterizing how these solutions relate to parameter estimates from earlier methods. Proposition 1. ) be optimal solutions to optimization problem (4). Then as η → 1−, (α G (η),B G (η)) will converge to the solutions of: ) be optimal solutions to optimization problem (5). Then as η → 1−, (α S (η),B S (η)) will converge to the solutions of: ) be optimal solutions to optimization problem (6) Then as η → 1−, (α SN (η),B SN (η)) will converge to the solutions of: Case 1.1 implies that in the generalist setting, when η → 1−, optimal OEC G estimates will yield fitted values equal to those produced by the ToM linear model (without regularization): When X is of full rank, the equivalence can be stated in terms of the parameter estimates directly (and thus predictions made using these solutions):γ ToM =α G 0 + kα G kβ G . Case 1.2 states that in specialist settings, when η → 1−, optimal OEC S solutions will yield fitted values equal to those produced by a linear model fit only to the target study, (i.e., study k * ): We next characterize the optimal OEC solutions when η is taken to the opposite extreme. Proposition 2. Case 2.1. (OEC G ) Let (α G (η),B G (η)) be optimal solutions to optimization problem (4). Then as η → 0+, (α G (η),B G (η)) will converge to a solution to the following: ) be optimal solutions to to optimization problem (5). Then as η → 0+, (α S (η),B S (η)) will converge to a solution to the following: (η)) be optimal solutions to optimization problem (4). Then as η → 0+, (α SN (η),B SN (η)) will converge to a solution to the following: Proposition 2 shows that as η → 0+, the optimal OEC solution yields fitted values equal to those of its MSS counterpart. For example, in the generalist case, as η → 0+, , Proposition 2 states that MSS G , MSS S and MSS SN are special cases of their corresponding OEC counterparts that arise when η → 0+. For example, in the generalist case, if each of the X k are full rank, then . Together, these propositions demonstrate that in the linear-linear setting, without regularization, the OEC encompasses important special cases and earlier multi-study methods: MSS G is a special case of OEC G , arising when η → 0+, while pooling datasets with the ToM approach is a special case arising when η → 1−. Similarly, MSS S is a special case of the OEC S , arising when η → 0+, while a study-specific model is obtained when η → 1−. These propositions examine behavior at the extremes, and suggest that, for intermediate values of η, the OEC can be viewed as a trade-off between MSS G and pooling datasets together before model fitting (i.e., the ToM) in the generalist case. In the specialist case, the propositions suggest that the OEC can be framed as a trade-off between MSS S and only fitting a model on the target study. In our motivating application, we visualize this trade-off in the no data reuse specialist case by showing how predicted values of the OEC smoothly vary between the predictions of the study-specific model and the MSS SN as a function of η ∈ (0, 1). Across the range of η, the OEC allows for a spectrum of solutions that range in both the manner and degree to which information is borrowed from other sources. We consider a data source recently developed by the Human Mortality Database, called the Short-term Mortality Fluctuations (STMF) data series (Jdanov and others) . The STMF data series provides weekly death counts and death rates for 38 countries, as outlined in Supplementary Table 7 . We denote the reported death rates for country k at week t byỸ t,k . STMF calculates it as the observed weekly death count, C t,k divided by an annual measure of population size N l,k , that is:Ỹ t,k = C t,k /N l,k when week t occurs during year l. As a result, these death rates exhibit artificial discontinuities at the start of each calendar year, that are an artifact of changing the annual population size from N l,k to N l+1,k . To avoid this, we followed the strategy outlined in US Census Bureau (2012). We calculated the mid-year population size, N l,k , (i.e., observed population counts taken on July 1 of year l) from the death rates and death counts provided by STMF. We then linearly interpolated population size by regressing the annual population estimates, N l,k against year, l, and used the regression fit to calculate weekly estimates of population size: were linear model coefficient estimates from this regression. Finally we calculated new annual death rates (per 1,000 individuals) as Y kt = 1, 000 × 52 We only used test countries in the Northern hemisphere since the model was based on seasonal effects modeled by calendar date. There are only three countries from the southern hemisphere in STMF (Australia, New Zeland and Chile) a set too small to fit a model borrowing information across countries, thereby resulting in poor performance. We did, however, allow the OEC methods to leverage information from countries in the Southern hemisphere when training models tailored to countries in the Northern hemisphere. For any given test year, we required countries to have at least 2 years of data at that point to serve as auxiliary sources (i.e., countries from which we can borrow information for a given test country/year). We started testing at 2003 because prior to 2003 there was little data and excessive variability year-to-year in the number of training countries. It is common to model expected death counts as a function of seasonal and secular trends in mortality. We modeled annual death rates, Y k,t , with the following linear model: where γ k,1 represents a linear effect of time to capture secular trends in mortality, and γ k,2:p represents the effect of the Fourier basis function to capture seasonal changes. Model (3) can be represented in the familiar linear form, where the elements of x k,t simply contain basis expansions of time. One could adopt, for example, a Poisson likelihood for the conditional distribution of the death counts given the covariates (i.e., basis expansions of time). However, weekly death counts were in the hundreds for all countries except Luxembourg and Iceland and most countries recorded weekly counts in the thousands. Since the counts were high, we expect a Gaussian to provide a good approximation to a Poisson distribution. Since we sought to borrow information across countries, we modeled rates instead of counts to make the outcome more comparable across the countries in the database. We opted for model (3) assuming y k,t | x k,t follows a Gaussian distribution instead of a Poisson for computational convenience. In Supplementary Figures 10-12 we show the results of sensitivity analyses where we omit the linear effect of time γ k,1 t from model (3). Excess mortality in country k, at week t, can be defined as the observed death count minus the rate expected in the counterfactual world in which COVID-19 never existed. Estimating excess deaths therefore requires a prediction of mortality in the counterfactual world where a disaster did not occur. In practice, this prediction is the expected death counts in a particular week given historical data (i.e., all data before the onset of the disaster). Unfortunately, prediction performance in such settings is unverifiable. Hence, to assess if our methods are useful for counterfactual prediction, we trained models and tested their performance on historical data where no pandemic occurred, so that a ground-truth was known. Heuristically, the rationale is that if the baseline model has high prediction accuracy on historical data, we expect it would also predict counterfactual mortality well. We tested predictors generated using the OEC S and the OEC SN against their MSS S and MSS SN counterparts. We also implemented a SSM (which we refer to interchangeably as a "country-specific model" here), in which we did not borrow information from other countries. In all approaches, we used model (3) as the single-study learner. We sought to evaluate prediction performance for all countries meeting the inclusion criteria in the database across a range of training/testing years. We assessed our question first in the STMF dataset using a time-series hold-onecountry-out testing procedure. That is, we trained models on a subset of historical data (i.e., a subset of years), made predictions on held out historical data and computed performance metrics (e.g., RMSE) to quantify the quality of the predictions. We tailored our approach to assess whether our method improved performance when there exists little pre-COVID-19 training data for the target country. Specifically, for each target country and year (from 2003-2019) we artificially assumed only having one year of training data from that target country in the database. We then made predictions of mortality in that country for the following year and compared those predictions to what was observed. Our approach is illustrated in Figure 1 . To fix ideas, say we are interested in making predictions for the Country 1 in 2014. We would use data from Country 1 in 2014 as a test set and trained a country-specific model on US data from only 2013 (i.e., simulating only having one year's worth of data). To analyze the performance when we borrowed information, we allowed the MSS S , MSS SN , OEC S and OEC SN to use data from any other country that had at least 100 weeks of training data prior to 2014 (e.g., 2011, 2012 and 2013) . All data after the start of the test period was discarded and any country that did not have enough training data was not used as an auxiliary source. All approaches were compared based upon their prediction performance on US data from 2014. We repeated this procedure for every feasible year between 2003 and 2019 and for all countries (assuming the above inclusion criteria are met). The test period was set to an entire year to assess the performance of the method's estimation of both secular and seasonal trends as well as to mimic the long term predictions required for COVID-19 excess mortality estimation. We ran analyses both with and without a Ridge (squared 2 ) penalty term on the country-specific model parameters and the ensemble weight parameters. The inclusion of a penalty was kept consistent across all methods that we implemented. This was to ensure that any gains in performance achieved through the OEC could not have been attained through simple regularization. We centered and scaled covariates prior to model fitting. We tuned η in expression (1) using cross validation procedures specific to generalist and specialist settings. For the OEC S and OEC SN , we tuned η specifically for the target study. We divided the training set for study k * into K folds (i.e., we set the number of folds to K, the number of training studies) and used each fold as a validation set. For OEC G , we tuned η with a K-fold cross validation procedure in which the folds were constructed to be study-balanced in order to encourage generalizability to an unseen study. The 2 regularization parameter associated with ensemble weights, µ, was tuned using a K-fold, study-balanced, cross validation scheme separately using the OEC G algorithm. We kept this tuning parameter fixed across all OEC algorithms. Similarly, we tuned the 2 regularization parameter associated with the stacking regression again using a K-fold cross validation. We tuned study-specific model regularization parameters associated with the Ridge penalty using a within-study (i.e., within-country) K-fold cross validation scheme and kept these fixed across all methods implemented (i.e., ToM, SSM, and all OEC and MSS approaches). The 2 regularization parameter associated with ToM algorithm was tuned using a hold-one-study-out cross validation (e.g., the K th fold merged all studies, 1, ..., K − 1, fit a single model and validated on study K). Year 2013 from country 1 is utilized for training and year 2014 is taken as a test set; other data from country 1 is discarded for training or testing purposes. Country 2 and country 4 are used in multi-study versions. Data after the test year is discarded. Country 5 and Country 3 are not utilized since they do not have data before the test year 2014. This prediction is repeated for each year and country that meet the criteria for inclusion. We present results by test-year for both the the OEC S and OEC SN in Figure 2 and Supplemental Figure 9 . We present results from models fit both with study-specific Ridge penalties (Supplemental Table 6 ) and without them (Table 1) . Table 5 . Although we fit the generalist methods ToM, OEC G and MSS G , their performance was uniformly poor. This is not surprising given that this application involves transfer learning. As a result, we do not present these results here. The figures and tables demonstrate that borrowing information through two-stage stacking procedures (MSS S or MSS SN ) or their OEC counterparts (OEC S and OEC SN ) has strong benefits with or without regularization. Importantly, the OEC S substantially outperformed both a country-specific model and the MSS S . The OEC SN outperformed the MSS SN but improvements were more modest than those comparing the OEC S and the MSS S . The OEC SN and MSS SN substantially outperformed the OEC S and the MSS S , respectively. We also present the results relative to a country-specific model to show the relative performance of the OEC and MSS methods in Supplemental Figure 9 . When training data is limited, including a linear term for time to capture secular trends in a country-specific model may introduce identifiability issues. This is because the model may struggle to distinguish between seasonal and secular trends when trained on a single year of data, even with regularization. While we showed that borrowing information alleviates this issue, we wanted to ensure the multi-study approach was still beneficial compared to a simpler model with no linear effect of time. In the supplemental section we present results from analyses using a model with no linear term (4). We are not aware of past analyses that only model mortality with seasonal trends, but we explored these results as a sensitivity analysis to further characterize how the OEC methods improve performance over MSS methods. First, we compared an OEC with a linear trend to a country-specific model with no linear trend (Supplemental Figure 10 ). The OEC SN consistently outperformed the simpler country-specific model, but the OEC S and two-stage specialist stacking did not. This may suggest that the country-specific model struggled with identifiability problems and that the OEC S and MSS S may place too large an ensemble weight on that target country's model. The MSS SN and OEC SN avoid this problem by only using the target data to calculate ensemble weights. This is consistent with previous reports about limitations of the data re-use approach MSS S (Ren and others, 2020). We next compared the OEC S and OEC SN models fit with no linear trend to their MSS counterparts. In both, the design matrix only included Fourier basis terms. Results from this analysis can be found in Supplemental Figure 12 . In this simpler modeling scheme as well, the OEC S and OEC SN approaches outperformed their MSS counterparts, suggesting that the OEC approach can improve the estimation of both seasonal and secular trends. Importantly, the OEC SN and OEC S performed comparably. This may suggest that the OEC SN yields the greatest benefit over OEC S when the modeling scheme is susceptible to problems related to identifiability or over-fitting. Finally, we compared OEC SN and OEC S models fit with no linear term for time to a country-specific model without a linear term for time (Supplemental Figure 11 ). These results demonstrate that also in the simpler modeling case, there are strong benefits to borrowing information with the OEC approach. The present work seeks to develop methods to improve prediction performance when training data for a target study (country) is limited. We conducted a sensitivity analysis to characterize how prediction performance varies as a function of the quantity of training data available for the target study. Supplemental Figure 8 shows that OEC S was associated with superior prediction performance compared to both a SSM and a MSS S when the target study had as much as three years of training data, although the benefit monotonically decreased as number of training observations increases. Importantly, borrowing information was never detrimental, on average, in any of the settings explored. We conclude the application with an example to demonstrate that the country-specific model and the MSS SN are special cases of the OEC SN that arise from limiting values of η ∈ (0, 1] (Figure 3) . We selected the country and test year presented in the figure to show an example that yielded country-specific estimates and OEC SN estimates that were different enough to visually depict the effect of η but did not overstate the utility of borrowing information with either MSS SN or OEC SN . This test year exhibit a spike in mortality counts for three consecutive outlying weeks early in the year. This type of pattern occurred commonly across countries and years and is difficult for any modeling approach to capture. As η → 0, the OEC predictions approach that of the MSS SN . As η → 1, the OEC SN predictions approach that of the country-specific model. We next conducted data-driven simulations to explore the COVID-19 application in a setting where we could compare predictions against a ground-truth. We simulated data from a linear mixed effects model: which includes a column of ones for an intercept. x k,t includes only basis expansions of time and was constructed as described above in the COVID-19 application. We chose the parameters of the random effects distribution based on estimates from the COVID-19 data. Denoting bŷ γ k the coefficient estimates from an OLS fit to all observed data (i.e., all years prior to 2020) from country k, we define µ θ =γ = 1 K kγ k to be the sample mean of each coefficient, averaged across country-specific estimates. To explore the impact of heterogeneity in model coefficients on prediction performance we choose multiple scenarios for σ 2 θ . For Σ θ , we used the empirical covariance matrix from the application by setting Σ θ =Σ γ whereΣ γ = 1 K−1 k (γ kγ T k −Kγγ T ). We assumed that the residuals of study k are distributed as k ∼ N n k (0, σ 2 k I). We selected the variance of the residuals, σ 2 k , by sampling with replacement from the vector of estimated residual variances from each of the country-specific models [σ 2 1 , ...,σ 2 K ] T . This allowed us to realistically emulate the variability in noise associated with the observed COVID-19 data from each of the countries. The design matrix for study k, X k is constructed as in the COVID-19 application from time (in weeks): it includes a linear effect for time and 4 terms that are Fourier basis expansions of time for seasonal trends. Including the column of ones for the intercept, X k ∈ R n k ×6 . The test country was set to have 52 weeks of (weekly) training data (n k * = 52) and 52 weeks of test data. We drew n k , the number of weeks of training data for k ∈ [K] \ k * , from a discrete uniform: n k ∼ Unif(104, 517). The parameters of the uniform were selected based upon the observed sample size in the real dataset and to ensure each study, for k ∈ [K] \ k * , had at least twice as many weekly observations as the target country, k * . Although the errors k are i.i.d, the basis functions used in the simulation scheme induces autocorrelation in the outcome y k , which we explored by inspection of the empirical autocorrelation function (ACF). We included a Ridge penalty in the stacking regression for the MSS methods. We included a Ridge penalty in the ensemble weighting component of the loss in the OEC methods as well. We assessed performance of the MSS and OEC methods with and without study-specific Ridge penalties. We present results with study-specific Ridge penalties in the supplement (Supplemental Figure 13 ; Supplemental Table 8 ). Results were comparable with and without the study-specific penalties. In almost all settings explored, the OEC and MSS approaches substantially outperformed simply using a countryspecific model. The OEC S appears to perform better compared to the MSS S when the magnitude of between-study heterogeneity in true model coefficients, (σ 2 θ ) is lower. Conversely, the OEC SN performed better relative to the MSS SN approach when the σ 2 θ was higher. Both the OEC S and OEC SN performed better relative to their MSS counterparts for lower K, although the effect of K on performance varied as a function of the magnitude of σ 2 θ . Importantly, the OEC S and OEC SN almost never yielded worse average performance than their MSS counterparts and they often conferred substantial benefit. In the rare instances where the OEC exhibited worse average performance than its MSS counterpart, it was only by roughly 1%. These results demonstrate 1) borrowing information is almost uniformly beneficial and 2) the OEC approaches almost always outperformed using a MSS method. We next report on a second set of simulations whose goal is to characterize the performance of both the "generalist" and "specialist" algorithms, outside of a time series setting. We simulated datasets to characterize the effect of three features of multi-study settings on the performance of our proposed methods: 1) covariate-shift (heterogeneity in f x k (X k ) across studies), 2) concept-shift (heterogeneity in f y k | x k (y k | X k ) across studies), and 3) study clusters Table 3 : Average data-drive simulation performance without regularization at different K and σ 2 θ . Performance of each method is RM SE OEC S /RM SE SSM , where SSM denotes a study-specific model, fit only on the target study. Monte carlo error was at most 0.0073. across which (1) and (2) vary. We generated clusters as in (Loewinger and others, 2021) : groups of studies that shared similar distributions, f X k (X k ) and f y k | X k (y k | X k ). To control both within and between-cluster heterogeneity in f y k | X k (y k | X k ), we simulated f y k | x k (y k | X k ) from a linear mixed effects model that included both cluster-specific and study-specific random effects: where study k is in cluster c, δ c ∈ R p+1 is a cluster-specific random effect and θ k ∈ R p+1 is a study-specific random effect. We independently drew δ c | µ δ ∼ N p+1 (µ δ , σ 2 δ I) and θ k,j ∼ Unif(−σ 2 δ /20, σ 2 δ /20) so that that random effects varied across studies within a cluster by a degree proportional to the between-cluster heterogeneity. The vector of random effects δ c is centered at the fixed effects, µ δ ∈ R p+1 , where we independently drew µ δ,j ∼ Unif(−2, 2). We simulated the random effects to be independent of the error term: θ k ⊥ ⊥ k and δ c ⊥ ⊥ k . As above, X k includes a column of ones so that the model includes an intercept. We conditionally drew ki | σ 2 k ∼ N n k (0, σ 2 k I n k ) where σ 2 k ∼ Unif(1, 2), inducing heterogeneity between studies in the variance of the residuals. We drew n k ∼ Unif(150, 300) (discrete uniform). n k * = 50 (the training set for the target study). n test = 100 (test set for each iteration). We selected these values to ensure that n k > n k * . These sample size values were motivated by the ratio of n k /n k * observed in the application. We selected K = 5 since many multi-study settings encountered in practice have few training studies available. We selected p = 20 and set 10 coefficients to 0, to model sparsity in the true model coefficients, as relevant in many prediction settings. We simulated the covariates so that the marginal distribution of the covariates, f X k (X k ), differed across studies and clusters. We drew a vector of covariates for observation i of study k as x k,i | µ X k ∼ N p (µ X k , Σ X ) where Σ X is a covariance matrix that was randomly generated (i.e., varied across simulation iterations) but was held fixed across studies within an iteration and µ X k was a study specific vector of covariate means. To induce covariate shift, we modeled the means of the covariates in study k and cluster c as where ζ c , τ k ∈ R p . We drew cluster-specific covariate means, ζ c |μ ∼ N p (μ, σ 2 X I) and we independently drew τ k ∼ Unif(−0.05, 0.05). Therefore σ 2 X controls the magnitude of heterogeneity across clusters in the means of the covariates. We independently drewμ j ∼ N (5, 10). Thus studies differed within and across clusters in the covariate distributions. We simulated sets of studies both with and without study clusters. In the "no cluster" case, we drew each training and test study to be in a separate cluster (C = 6). In the "cluster" case, we generated three clusters with two studies per cluster (C = 3). The test study was also generated as belonging to one of the same three clusters. We simulated 100 iterations for each set of simulation parameters, each iteration consisting of a set of training studies and a test set. Each iteration therefore produced an RMSE for each method. We show the distribution of these RMSEs across the 100 iterations in figures below. We used both Ridge regression and OLS as the single-study learners. We tuned model parameters as in the COVID-19-driven simulations. We present figures and results without study-specific Ridge penalties in Supplemental Section 11. We present results comparing each OEC approach to the corresponding two-stage stacking approach, to investigate whether jointly training an ensemble in this framework is superior to training the ensemble with MSS. We included a subset of the results in Table 4 and the full version in Supplemental Table 14 . We also present the results relative to a common baseline in Supplemental Table 13 . Table 4 : Average simulation performance with (C = 3) and without (C = 6) clustering at varying degrees of covariateshift (σ 2 X ) and variance of random effects (σ 2 δ ). Each section indicates the performance of the method (e.g., OEC G ) relative to a baseline of the ToM algorithm or a study-specific model respectively (e.g., RMSE OEC G / RMSE ToM , RMSE OEC S / RMSE MSS S ). Monte Carlo error was at most 0.003. The OEC G outperformed the MSS G across all values of both σ 2 X and σ 2 δ , when there was clustering in the studies. The OEC G outperformed the ToM across all values of σ 2 X and σ 2 δ except for very high levels of σ 2 δ , where the two algorithms were comparable. When there was no clustering, the OEC G outperformed both the ToM and the MSS G at lower values of σ 2 δ and was comparable to both algorithms at higher values of σ 2 δ . The OEC S exhibited superior performance compared to the MSS S in most settings explored. However, when at high levels of σ 2 δ and when the data exhibited no clustering, the algorithms performed comparably. The OEC SN tended to perform better when there was no clustering and there was higher variance of the random effects. When the data exhibited clustering, the MSS SN performed comparably or slightly better than its OEC counterpart in all settings explored. While covariate shift appeared to influence the relative performance of the methods in the generalist setting, it did not appear to explain much of the variability in performance of the OEC in the specialist case (i.e., OEC S and OEC SN ). At a fixed value of σ 2 δ , the performance of the OEC G relative to the MSS G varied as a function of σ 2 X . It appeared the relative performance of the OEC G improved as a function of σ 2 X . However, it appears the OEC S performed better relative to its MSS counterpart when there were lower levels of covariate shift. These results demonstrate the benefits of an all-in-one approach. The OEC G often outperformed both MSS G and the ToM. While MSS G often performed well compared to the ToM, it suffers from cases where it is vastly outperformed by the ToM. For example, when σ 2 δ and σ 2 X were high and there was clustering in the studies (C = 3), the generalist approach exhibited an average RMSE about 80% higher than the ToM. This echos the frequent empirical observation as well as insights from analytical work comparing ToM and ensembling (Ren and others, 2020; Guan and others, 2020) . The OEC G approach however, does not suffer from cases where it is vastly outperformed by the standard benchmarks; indeed, the OEC G was comparable to the ToM in that simulation setting. The ToM modestly outperformed the OEC G on average when σ 2 δ was high and the studies did not exhibit clustering, but this appeared to be driven by outliers. The performance of the OEC S and OEC SN further demonstrate the utility of jointly estimating model parameters and ensemble weights. The OEC S strongly outperforms the study-specific model and the MSS S (by as much as about 15%). While the OEC SN performs comparably or slightly worse than the MSS SN method in some settings, it importantly never performs worse than a study-specific model. When studies do not cluster and σ 2 X and σ 2 δ are high, the MSS SN method exhibits an RMSE about 75% higher than a simple study-specific model. In these cases, OEC SN still remains superior to or competitive with the study-specific model and the MSS SN . Taken together, these simulations demonstrate that the performance of the OEC was often comparable or superior to MSS in both generalist and specialist settings. Additionally, while MSS methods sometimes exhibited very poor performance compared to benchmark approaches, the OEC methods consistently outperformed these methods in these cases, highlighting the robustness of these approaches. We propose and evaluate Optimal Ensemble Construction (OEC), a flexible approach that can be used to construct ensemble learners in domain generalization and transfer learning. OEC generalizes two-stage multi-study stacking in cases where the individual learners are finitely parameterized, by specifying an explicit loss function. OEC improved prediction performance of both specialist multi-study stacking (for transfer learning) and generalist multi-study stacking (for domain generalization). We observed the most consistent gains in the transfer learning setting. In our application, we showed how leveraging external sources of data through the OEC can be used to improve the accuracy of excess mortality estimation when countries do not have sufficient pre-disaster data. Even for countries that have sufficient training data, our method may prove useful when analyzing mortality (or any other vital statistics outcome) stratified by demographic indicators, or subregions of a country (Islam and others, 2021) since it is common to encounter small counts for the outcome when conducting analyses within substrata. In such cases, it may prove beneficial to borrow information from neighboring regions, or other demographic groups. More broadly, using multistudy methods in counterfactual estimation may be a promising area of future research in settings where data sources for counterfactual estimation are limited. The work also adds to the growing body of literature exploring the application of ensembling methods in COVID-19-related prediction problems. For example, ensembling techniques have proven useful for predicting cases (Ray and others, 2020; Raj and others, 2020) . Recent work has proposed gradient boosted trees to directly estimate excess mortality during the COVID-19 pandemic based upon a large set of covariates such as COVID-19 case counts and demographic variables (The Economist, 2021). The authors proposed to use data from many countries during model training and motivated their methods by the need for excess mortality predictions in countries which have not made estimates available. They propose to pool data from different countries before fitting one global model to predict excess mortality given a set of country-specific covariates. This differed from our method which generated an ensemble of country-specific models to predict baseline mortality based upon historical mortality trends. Our method is not without its limitations. A disadvantage of combining the study-specific and ensemble weighting loss functions is that non-convexity arises from the resulting bilinear terms. As a result, the optimization approach implemented here is not guaranteed to converge to the global minimum and requires careful initialization. We found that initializing at MSS estimates yields consistently high prediction performance in practice and appeared to outperform other heuristics such as random restarts. Thus, the optimization procedure will always be more computationally expensive than multi-study stacking. However, in all the settings explored, the optimization procedure converged within a couple seconds and allowed for tuning over large grids of hyperparameter values. It is possible to obtain global solutions to the problems by using modern techniques in mixed integer programming (Bertsimas and others, 2017) . Such techniques can deliver optimality certificates for the associated optimization problems but would likely be much slower compared to the methods we present here. The OEC method that we proposed here lend itself to future extensions. For example, we have introduced the OEC in the regularized linear-linear setting, but one could easily replace the linear models with other specifications (e.g., support vector machines) at either the SSL stage, the ensembling stage or both. Also, loss functions could be replaced by penalized negative log-likelihoods reflecting distributions other than the Gaussian. Similarly, while the propositions focus on the linear models case, analogous results can be shown for the broader class of generalized linear models. However, we keep the analysis focused on the linear case, empirically explored in the present work. In addition, some of the results above can be recast to account for models with regularization, but these generalizations require additional regularity conditions and may not provide further insight into the connections between the OEC and earlier multi-study methods. The methods presented here can complement many methods proposed in the rich literature of transfer learning, domain generalization and multi-task learning (Zhang and Yang, 2017; Zhuang and others, 2020; Farahani and others, 2020) . A number of methods proposed in these bodies of work seek to jointly train ensembles using matrix decomposition methods of model parameters and matrix regularization schemes. Indeed, such methods can be used in conjunction with our framework since they focus on improving prediction performance through the model parameter estimation procedure, not the ensemble weighting scheme. In this sense, our work complements many cutting edge methods for multi-source data integration. In summary we propose a flexible generalization of multi-study stacking that yields improvements in prediction performance in both "generalist" and "specialsit" implementations. We hope the present work will be beneficial both methodologically, by complementing existing methods, as well as in improving excess mortality estimation in low counts settings. Code and instructions to reproduce analyses, figures and tables are available at: https://github.com/gloewing/ OEC. This contains code to tune and fit penalized linear regression problems with all the methods assessed here. Denote f (B, w) as the optimization function for the OEC G method: We describe the optimization algorithm for this case, with the understanding that the other cases will be similar. Figure 7 : Block coordinate descent algorithm for optimization of the OEC loss. 1 We terminate the algorithm when the relative difference in objective values across two successive updates in (β 1 , . . . , β k , w G ) is smaller than a threshold or the number of iterations is larger than 1000, whichever is sooner. Table 9 : Simulation performance without regularization at different K and varying degrees of σ 2 θ . Performance of each method is relative to a country-specific model (i.e., a model fit only on the target study, RM SE OEC S /RM SE SSM ) Monte Carlo error was at most 0.007. Table 10 : Simulation performance with regularization at different K and varying degrees of σ 2 θ . Performance of each method is relative to a country-specific model (i.e., a model fit only on the target study): RM SE OEC S /RM SE SSM . Monte Carlo error was at most 0.007. Table 12 : Simulation performance with (C = 3) and without (C = 6) clustering at varying degrees of covariateshift (σ 2 X ) and variance of random effects (σ 2 δ ). No Ridge penalty was included. Each section indicates the performance of the method (e.g., OEC G ) relative to a baseline of the ToM algorithm or a study-specific model (e.g., RM SE OEC G /RM SE T oM ). Monte carlo error was at most 0.0033. Excess mortality: the gold standard in measuring the impact of covid-19 worldwide Coronavirus tracker: the latest figures as countries fight the covid-19 resurgence Cross-study validation for the assessment of prediction algorithms Certifiably optimal low rank factor analysis Stacked regressions The nhgri-ebi gwas catalog of published genome-wide association studies, targeted arrays and summary statistics 2019 Dying to count: mortality surveillance in resource-poor settings 04). curatedovariandata: clinically annotated data for the ovarian cancer transcriptome Merging versus ensembling in multi-study machine learning Multi-source causal inference using control variates Excess deaths associated with covid-19 pandemic in 2020: age and sex disaggregated time series analysis in 29 high income countries Human mortality database. short-term mortality fluctuations data series (stmf) How comparable is covid-19 mortality across countries? Tracking excess mortality across countries during the covid-19 pandemic with the world mortality dataset An introduction to domain adaptation and transfer learning Hierachical resampling for bagging in multi-study prediction with applications to human neurochemical sensing Measuring the effect of inter-study variability on estimating prediction error Coronavirus pandemic (covid-19). Our World in Data Methods that remove batch effects while retaining group differences may lead to exaggerated confidence in downstream analyses A free and open platform for sharing mri, meg, eeg, ieeg, ecog, asl, and pet data Training replicable predictors in multiple studies Ensemble regression models for short-term prediction of confirmed covid-19 cases Tree-weighting for multistudy ensemble learners Hiv drug resistance prediction with weighted categorical kernel functions Modeling between-study heterogeneity for improved replicability in gene signature selection and clinical prediction Cross-study learning for generalist and specialist predictions Perturbed factor analysis: Improving generalizability across studies A survey of multi-source domain adaptation How we estimated the true death toll of the pandemic Methodology for the intercensal population and housing unit estimates Generalizing to unseen domains: A survey on domain generalization A novel concept drift detection method for incremental learning in nonstationary environments Stacking for transfer learning Robustifying genomic classifiers to batch effects via ensemble learning A survey on multi-task learning A comprehensive survey on transfer learning GCL was supported by the NIH, F31DA052153; T32 AI 007358. RA was supported by the NIH, T32ES007142. GP was supported by NSF-DMS grants 1810829 and 2113707. RM acknowledges partial research support from NSF-IIS-1718258. Conflict of Interest: None declared. No Data Reuse Specialist 2 0 1 0 2 0 1 1 2 0 1 2 2 0 1 3 2 0 1 4 2 0 1 5 2 0 1 6 2 0 1 7 2 0 1 8 2 0 1 9 2 0 1 0 2 0 1 1 2 0 1 2 2 0 1 3 2 0 1 4 2 0 1 5 2 0 1 6 2 0 1 7 2 0 1 8 2 0 1 9 0.9 Figure 10 : RM SE OEC /RM SE SSM Performance of the OEC method fit with a linear term relative to a study-specific (country-specific) model with no linear term. Table 13 : Simulation performance with (C = 3) and without (C = 6) clustering at varying degrees of covariateshift (σ 2 X ) and variance of random effects (σ 2 δ ). A Ridge penalty was included. Each column indicates the performance of a method relative to a corresponding multi-study stacking method (e.g., OEC G vs. MSS G indicates RM SE OEC G /RM SE M SS G . Monte carlo error was at most 0.003.