key: cord-0306535-ayuofz81
authors: Zang, C.; Zhang, H.; Xu, J.; Fouladvand, S.; Havaldar, S.; Cheng, F.; Chen, K.; Chen, Y.; Glicksberg, B. S.; Chen, J.; Bian, J.; Wang, F.
title: High-Throughput Clinical Trial Emulation with Real World Data and Machine Learning: A Case Study of Drug Repurposing for Alzheimer's Disease
date: 2022-02-01
journal: nan
DOI: 10.1101/2022.01.31.22270132
sha: bd3cc771ba79352f87d37c9f62888d9fcc1bfc31
doc_id: 306535
cord_uid: ayuofz81

Clinical trial emulation, which is the process of mimicking targeted randomized controlled trials (RCT) with real world data (RWD), has attracted growing attentions and interests in recent years from pharmaceutical industry. Different from RCTs which have stringent eligibility criteria for recruiting participants, RWD are more representative of real world patients whom the drugs will be prescribed to. One technical challenge for trial emulation is how to conduct effective confounding control with complex RWD so that the treatment effects can be objectively derived. Recently many approaches, including deep learning algorithms, have been proposed for this goal, but there is still no systematic evaluation and practical guidance on them. In this paper, we emulate $430,000$ trials from two large-scale RWD warehouses, covering both electronic health records (EHR) and general claims, over 170 million patients spanning more than 10 years, aiming to identify new indications of approved drugs for Alzheimer's disease (AD). We have investigated the behaviors of multiple different approaches including logistic regression and deep learning models, and propose a new model selection strategy that can significantly improve the performance of confounding balance of the participants in different arms of emulated trials. We demonstrate that regularized logistic regression based propensity score (PS) model outperforms deep learning based PS model and others, which contradicts with our intuitions to certain extent. Finally, we identified 8 drugs whose original indications are not AD (pantoprazole, gabapentin, acetaminophen, atorvastatin, albuterol, fluticasone, amoxicillin and omeprazole), hold great potential of being beneficial to AD patients.

Model training on training set

Model selection by goodness-of-balance on training and validation sets goodness-of-fit on validation set Figure 1 . Overview of our high-throughput clinical trial emulation system for Alzheimer's Disease drug repurposing driven by real-world data and machine learning. (a) High-throughput trial emulations of thousands of drug candidates were conducted on two large-scale and longitudinal real-world healthcare databases: OneFlorida and MarketScan. Target trial protocols (eligibility criteria, treatment strategies and assignment, follow-up, outcomes, etc.) were illustrated as a flow-chart (details in Method section). For each drug candidate, treated group consisted of patients who were prescribed with the trial drug, and control group was constructed by either random selection of alternative drug groups or using drug groups under the same second-level Anatomical Therapeutic Chemical classification codes (ATC-L2) as trial drug group. Hundreds of trials were emulated for each drug by constructing different control groups. *The number of patients in different groups and the outcomes were varied across emulated trials. MCI, mild cognitive impairment; AD, Alzheimer's Disease. (b) Causal effect estimation for each emulated drug trial and high-throughput screening of drugs. State-of-the-art AI-based propensity score (AI-PS) models were used and compared. Novel cross-validation framework for AI-PS models was proposed for training, selecting, and evaluating AI-PS in terms of goodness-of-balance and goodness-of-fit performance. The optimally trained and selected AI-PS model is used for inverse probability of treatment re-weighting (IPTW) high-dimensional patient baseline covariates, including age, gender, disease comorbidities, medications, etc., for confounding control. AD event or censoring event were tracked within two-year follow-up period, and estimated treatment effects were quantified by adjusted two-year survival difference and adjusted hazard ratio (HR). Potentially repurposing drug candidates were selected if their estimated treatment effects were significantly beneficial and consistent over emulated trials on different databases. AUC score on the validation set, maximum SMD after IPTW on the validation set, and our model selection strategy based on both the number of unbalanced covariates after IPTW on the training and validation combined set and AUC score on the validation set. We reported drugs with ≥ 10% balanced trials. The error bars indicate 95% confidence intervals by 1000-times bootstrapping. The (two-sided) independent two-samples T-test for testing the means of each two bars, and *, p < 0.05; **, p < 0.01; ***, p < 0.001; not significant (ns), p ≥ 0.05; LR-PS, regularized logistic regression-based propensity score models; LSTM-PS, long short-term memory network with attention mechanisms-based propensity score models 7 ; AUC, area under the receiver operating characteristic curve; SMD, standardized mean difference; IPTW, inverse probability of treatment re-weighting.

50 emulations by constructing ATC-L2 control groups. Taking the OneFlorida database (see Data Section) as our discovery set, 100 we included 73, 927 patients with MCI diagnosis from 2012 to 2020 (Fig. 1a) . We found 1, 825 unique drug ingredients and 101 emulated 182, 500 trials. We finally targeted at 66 drugs with 6, 600 emulated trials of which each treatment group has ≥ 500 102 patients. For each emulated trial, we randomly partitioned the data into mutually exclusive training, validation and testing 103 subsets as standard practice. All PS calculation models were trained on the same training set, and the best-estimated model showed consistent beneficial effects on both data sets (Table 1 , marked in bold). 169 We highlight these eight identified repurposable drug candidates in Fig. 3 198 Fluticasone is used to treat nasal symptoms, skin diseases, and also asthma. We observed that fluticasone was associated 199 with a consistent 10% reduced risk of AD (HR 0.90, 95% CI 0.84-0.94) in OneFlorida and a 14% reduced risk of AD (HR 7 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted February 1, 2022. Figure 3 . Eight repurposable drug candidates for AD with adjusted hazard ratios and 95% confidence intervals. Trial emulations of these eight drugs (a-h) were performed using OneFlorida (FL) and MarketScan (MS) data separately. For each drug, treated groups consisted of patients who prescribed the trial drug (eligibility criterion in the Methods section), and control groups were built by either: (1) randomly selecting alternative drug groups, or (2) using drug groups under the same second-level Anatomical Therapeutic Chemical classification codes (ATC-L2) as the trial drug. The primary analysis emulated 100 trials consisting of 50 random control groups and 50 ATC-L2 control groups (FL-All and MS-All), and two sensitivity analyses were using only random controls (FL-Rand and MS-Rand) or only ATC-L2 controls (FL-ATC and MS-ATC). The best regularized logistic regression-based propensity score (LR-PS) model selected by our proposed model selection strategy was used to adjust for 267-dimensional baseline covariates for each emulation. Mean hazard ratio (HR) of balanced emulated trials with 1,000-bootstrapped 95% confidence interval were reported.

8 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted February 1, 2022. ;

EHRs and claims, in the context of identifying repurposable drug candidates for AD. There are several aspects we would like to 217 highlight for our investigation.

• First, we emulated hundreds of trials for each drug based on two different ways of constructing control groups, which 219 allowed for potentially more robust estimation of treatment effects. In our investigation, indeed, we observed a large 220 variability (e.g., a large range of 95% confidence interval) of estimated treatment effects within emulated trials for 221 certain drugs (e.g., Fig. 3e , albuterol FL-Rand, HR 0.78, 95% CI 0.70-0.86), and sometimes a large discrepancy between 222 emulated trials when building control groups in different ways (Fig. 3f, Table. 3). Potential explanations were rooted in intrinsic heterogeneity across the two datasets: OneFlorida is 229 a regional database mainly covers patients' EHRs in Florida area, while MarketScan is a nation-wide claims database MarketScan were 767 and 5,041 respectively. Such inconsistency highlights the necessity of leveraging at least two 232 (different type of) data sets to derive robust and consistent evidence.

• Third, we conducted multiple sensitivity analyses to guarantee the robustness of our findings. We have investigated the 234 impact of different ways of building control groups on balance performance (Supplementary Figs. 6) . Our proposed 235 model selection strategy greatly improved the performance of different PS models over conventional approaches. We also 236 examined the influence of the balance diagnostics on the generated repurposing hypotheses. For example, if we adopted 237 a more stringent balance criteria by requiring zero tolerance of unbalanced covariates (compared with 2% used in our 238 primary analyses) in each emulated trial after re-weighting, we still recovered top four drugs among our reported eight 

• Last, compared with existing AD repurposing studies which typically focused on validating one or two hypotheses with 249 a single type of RWD 30, 36, 37 , our study offered a high-throughput way of generating and validating AD repurposing 250 hypotheses using both EHRs and claims 38 , which would further catalyze innovation in AD drug discovery at scale, or 251 can be broadly applied to other diseases.

Lots of recent research efforts have been devoted to developing complex deep learning-based models for propensity score 253 based modeling 7, 39-42 In this paper, after emulating hundreds of thousands of trials from two large-scale RWD warehouses, we 254 found that one LSTM-PS 7 , which is a representative deep learning based PS method, did not outperform LR-PS. Our study 255 also highlighted the importance of model selection and we proposed our own strategy under which we demonstrated LR-PS 256 outperformed gradient boosting tree-based PS models and deep multi-layer peceptron-based PS models as well in terms of 257 balancing performance and the number of generated repurposing hypotheses. In addition, we also evaluated another model 258 selection strategy widely used in literature 4, 43-46 , which did not follow the out-of-sample validation strategy by partitioning 259 data into complementary subsets but just estimated and evaluated PS model on the entire data set. We observed that with this CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted February 1, 2022. ;

Tables 2) which were provided by physicians and validated in 47, 48 , yet there might be a certain level of inaccuracy due to 267 mis-and under-diagnosis or the lack of clinical details in EHRs or claims 38, 49 . Information contained in clinical notes will be 268 explored in the future through natural language processing to complement the structured codes. Second, although we balanced 269 high-dimensional covariates collected during the baseline period, measurement error, residual confounding, and selection bias 270 in the follow-up period were still possible. Therefore, adapting negative control 50 for detecting residual confounding and 271 selection bias to high-throughput trial emulation settings would be another promising direction.

In this work, we proposed a high-throughput clinical trial emulation system for AD drug repurposing driven by RWD and High-throughput trial emulation for Alzheimer's disease (AD) 294 Instead of emulating one single target trial, here we aimed to explicitly emulate hundreds of thousands of target trials (referred 295 to as high throughput) to identify potential new indications of non-AD drugs that existed in our utilized two RWD warehouses. 296 We described the protocols of high-throughput trial emulations in detail as follows and summarized their protocol components 297 and their corresponding (high-throughput) target trials in the Extended Data Table 1 . An illustration of the cohort selection 298 process was shown in Fig. 1a .

Eligibility criteria. We included patients with at least one mild cognitive impairment (MCI) diagnosis between January 300 2012 and April 2020 from the OneFlorida database (January 2009 to Jun 2020 for the MarketScan data). Other inclusion 301 criteria were age at MCI diagnosis ≥ 50, no history of AD or AD-related dementia diagnoses before the baseline, the first MCI 302 diagnosis date should be prior to the baseline, and ≥ 1 year of records before baseline. Note, we define the baseline as the first 303 prescription date of the trial drug, and at baseline, all of the above criteria should have been met.

Treatment strategies We compared two strategies for each drug trial: (0) no initiation of the trial drug before or after 305 baseline (control group), and (1) initiation of the trial drug at baseline (treated group). We defined the treatment initiation date 306 with the drug of interest as the first prescription date of the drug and we required at least two consecutive drug prescriptions 307 over 30 days since the first prescription date in our database as a valid drug initiation. 308 Treatment assignment procedures. We classified patients into different drug groups according to their baseline eligibility 309 criteria and their treatment strategies. We assumed the treated group and control group were exchangeable at baseline conditional CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted February 1, 2022. ; https://doi.org/10.1101/2022.01.31.22270132 doi: medRxiv preprint commonly prescribed drug ingredients for the co-prescribed medication covariates for each drug trial and thus the medication 317 covariates varied in different drug trials. We used 2 covariates age and sex for demographics and 1 covariate for the time 318 from the MCI diagnosis date to the drug initiation date. In total, there were 267 covariates to adjust for. In addition to the 319 267 baseline covariates, we also considered the temporal sequences of each of diagnoses and medications for the deep long 320 short-term memory network with attention mechanisms-based PS calculation 7 . 321 Follow-up. We followed each patient from his/her baseline until the day of the first AD diagnosis, loss to follow-up 322 (censoring), 2 years after baseline, or the end date of our databases, whichever came first.

Outcomes. The outcome of interest is the diagnosis of AD recorded in the database within his/her follow-up period, which 324 was denoted as a positive event. If there was no AD diagnosis recorded in a patient's follow-up period, and the last prescription 325 date or the last diagnosis date recorded in the database came after the end of the follow-up, then we marked it as a negative event.

A censoring event is a case where there was no AD diagnosis recorded in a patient's follow-up period and the last prescription 327 date and the last diagnosis date recorded in the database came before the end of the follow-up. The time to positive event is 328 defined as the days between the baseline date and the first diagnosis of AD. The time to negative event is the time of follow-up.

The time to censoring is defined as the days between the baseline date and the last prescription date or the last diagnosis date, 330 whichever comes last. Clinical phenotypes were identified by the selected diagnosis codes by experts (Supplementary Tables 2).

Causal contrasts of interest. The observational analogy of intention-to-treat effect of being assigned to trial drug initiation 332 versus no initiation at baseline.

High-throughput emulation. We emulated trials for all drugs appeared in our databases with at least 500 eligible patients 334 in their treated groups. For each emulated trial, its treated group consists of eligible patients who initiated the trial drug, and its 335 control group consists of eligible patients who had no initiation of the trial drug. We constructed the no-initiation patients group 336 in two ways: a) randomly selecting eligible patients from other drug initiation group 48 , or selecting patients from similar drug 337 groups that are under the same second-level Anatomical Therapeutic Chemical classification category 54 (ATC-L2) as the target 338 trial drug 6 . We further exclude any of those patients who were also in the trial drug group or prescribed the trial drug before 339 baseline. To get statistically significant results with varying control groups, we emulated 100 trials for each target drug and 340 among which 50 emulated trials used random controls and the other 50 emulated trials used ATC-L2 controls as described 341 above.

Causal effect estimation and screening of repurposing drugs. 343 We used propensity score (PS) methods 55 for confounding adjustment, estimated treatment effects of a great number of emulated 344 drug trials, and proposed two criteria to screen and prioritize non-AD drugs for repurposing (Summarized in Fig. 1b) .

Propensity score and IPTW. For each emulated trial, we used propensity score (PS) framework 55 to learn empirical treatment assignment given baseline covariates, and use the inverse probability of treatment weighting (IPTW) 56 to balance treated and control groups. We use triplet (X, Z,Y, T ) to represent data of both treated and control groups where X, Z, Y , T represent the baseline covariates, treatment assignment, outcome indicator, and time to events respectively. The PS is defined as P(Z = 1|X) 55 where Z is treatment assignment (Z = 1 and Z = 0 for treated and control respectively) and X denotes patients' observed baseline covariates. The inverse probability of treatment weight (IPTW) is defined as Z P(Z=1|X) + 1−Z 1−P(Z=1|X) 56, 57 , which tries to make original trial into a more balanced pseudo trial by re-weighting each data sample. We use an updated version named stabilized IPTW, defined as

to deal with extreme re-weighting weights and thus potentially inflated sample size 7, 58, 59 .

A machine learning (ML) or deep learning (DL)-based propensity score (ML/DL-PS) model is a binary classification model 347 f θ ∈ F Θ : X → Z, to approximate P(Z = 1|X) by f θ (X) with learnable parameters θ . Here, we use F Θ to denote a set of 348 ML/DL models (e.g. a set of models with varying hyper-parameters) and f θ to denote one specific model instance in this set. Performance evaluation criteria. We evaluated the performance of estimated PS models in terms of two aspects: a) the 355 goodness-of-balance, and b) the goodness-of-fit.

The goodness-of-balance is measured by the standardized mean difference (SMD) 17, 43, 60 on the whole dataset, defined as 11 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted February 1, 2022. ; https://doi.org/10.1101/2022.01.31.22270132 doi: medRxiv preprint follows:

where x treat , x control ∈ R D represent the vector representations of D covariates of treated group and control group respectively, µ treat , µ control ∈ R D are their sample means over the treated group and control group respectively. Similarly, s treat 2 , s control 2 ∈ R D are their sample variances. Suppose that we have learned sample weight w i for each patient i by IPTW, the weighted sample mean and variance are:

The weighted versions of sample mean and variance hold for both treated and control groups and thus we ignore their corner marks for brevity. The SMD weight can be calculated by applying above weighted mean and variance to Eq.2. All operations in Eq.2 and 3 are conducted in an element-wise way for each covariate. For each dimension d of either original SMD or weighted SMD, it is considered balanced if its d th SMD value SMD(d) ≤ 0.1 17 , and the treated and control groups are balanced if the total number of unbalanced features ≤ 2% * D 7 . Taking IPTW re-weighted case as an example, we can calculate the number of balanced feature after IPTW by:

The smaller the n weight is, the better the balance performance of IPTW is, and the less biased estimated causal effect is. As shown in 60 , SMD is one of top predictors of the bias of estimated causal effect. To quantify balance performance of high-throughput emulation of one drug trials, we further define the probability of successfully balancing one specific drug M trial by a set of PS models F Θ as P M (n weight ≤ 2% * D | F Θ ), which can be estimated by the fraction of successfully balanced trials over all emulations as follows:

where n e is the total number of emulated trials (X, Z,Y, T ) i , i = 1, 2, ..., n e for drug M, f best is the best PS model among F Θ 357 learned from the i th emulated trial, and the IPTW and n weight are calculated by applying f best to the i th emulated trial. We discuss 358 how to learn and select f best ∈ F Θ in the next section. In general, the larger the balancing success rate P M (n weight ≤ 2% * D | F Θ ) 359 is, the better the F Θ model balances the drug M trial.

The goodness-of-fit is the generalized prediction performance of PS model on the unseen data. We use the area under the 361 receiver operating characteristic (AUC) measured on the (unseen) testing set to quantify it 61, 62 . The larger AUC on testing set 362 is, the better generalization performance of the classification model is. was observed in our high-throughput study.

Here, we introduce our model training and selection algorithm tailored for ML/DL-based PS model in Algorithm 1, trying 376 to get the best goodness-of-balance performance as well as the best possible goodness-of-fit performance. We also describe the 377 evaluation (testing) algorithm for ML/DL-based PS models in Algorithm 2, to evaluate and benchmark different learned and 378 selected models. We use binary cross-entropy loss L as the objective function for learning empirical binary propensity scores.

12 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted February 1, 2022. ;

Input: (X train , T train ), (X val , T val ): training and validation sets of patient data; F Θ : a set of ML/DL-PS models; Output: f best : the best PS model learned from (X train , T train )

1: for every f θ in F Θ do 2:

training f θ on the training set (X train , T train ) by optimizing binary cross entropy loss L (T train , f θ (X train )) 3:

computing stabilized IPTW w by using f θ and Eq. 1 on (X train ∪ X val , T train ∪ T val ) 4:

computing re-weighted SMD weight on (X train ∪ X val , T train ∪ T val ) by using w, Eq. 2 and Eq. 3 5:

computing the number of unbalanced features n weight after IPTW by Eq. 4 6:

computing the AUC of f θ on the validation set (X val , T val ) 7:

updating best selected model f best ← f θ if n weight is smaller than the current minimum n weight , or n weight is equal to the current minimum n weight and the AUC is smaller than the current minimum AUC 8: return f best Algorithm 2 ML/DL-PS model evaluation (testing) algorithm Input: (X train , T train ), (X val , T val ), (X test , T test ): training, validation and test sets of patient data; f best : a PS model to be evaluated;

Output: the goodness-of-balance and goodness-of-fit performance of f best 1: computing stabilized IPTW w by using f best and Eq. 1 on the whole dataset (X train ∪ X val ∪ X test , T train ∪ T val ∪ T test ) 2: computing re-weighted SMD weight on (X train ∪ X val ∪ X test , T train ∪ T val ∪ T test ) by using w, Eq. 2 and Eq. 3 3: computing the number of unbalanced features n weight after IPTW by Eq. 4 4: computing the AUC of f θ on the test set (X test , T test ) 5: return n weight , AUC Statistical analysis. We used high-throughput emulating trials for each drug following protocols discussed in the above 380 section. We estimate intention-to-treatment effects for each of emulated trials. We applied different ML/DL-based PS models, we reported their sample means of different outcome estimator with 95% confidence intervals 67 over all the balanced trials. 386 We used two ways of building different control groups (random controls or ATC-L2 controls) and different balance criteria 387 (different thresholds for SMD) to evaluate the robustness of our estimated effects in sensitivity analyses.

Screening and prioritization. To generate reliable and robust repurposing hypotheses for AD, we require that the estimated 389 beneficial effects of repurposing drug candidates should be significant and consistent. As for the significant (beneficial) effects, 390 we require that the fraction of successfully balanced trials of a drug candidates after IPTW ≥ 10%, and their adjusted 2-yr 391 survival difference of these balanced trials should be significant. We use bootstrapping hypothesis testing 67 to test if the sample 392 mean of the adjusted 2-yr survival difference from all the balanced trials is > 0 (< 1 for HRs), and we consider p-value < 0.05 393 as significant. As for the consistency of effects, we require that the estimated effects should be all significantly beneficial over 394 different databases. We then ranked the drug candidates according to their estimated effects.

Comparison with existing works. We replicated the analytic approach by Liu et al. 7 and we found that their methods 396 led to biased SMD estimation and worse balance performance as shown in Table 1 

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted February 1, 2022. 17 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted February 1, 2022. ; Extended Data Table 1 . A summary of the protocol of target trials and high-throughput emulations to estimate the effect of different drugs on AD risk using real-world healthcare data OneFlorida (2012 OneFlorida ( -2020 and MarketScan (2009 MarketScan ( -2020 Same as for the target trial We define the medication initiation date to be the first date of a prescription of the trial drug and we require ≥ 2 prescriptions within ≥ 30 days from the initiation date as a valid initiation.

Patients are randomly assigned to either treatment strategy at baseline and are aware of the strategy they are assigned to

We classify patients into different groups according to their baseline eligibility criteria and treatment strategy. We assumed that the treated group and control group were exchangeable by adjusting for high-dimensional confounders collected before the baseline, including: demographics, diagnoses, medications, time lag between MCI initiation and index date, etc. Outcomes Diagnosis of AD Same as for the target trial, we define AD diagnosis according to selected ICD-9/10 codes within follow-up

We follow each patient from his/her baseline date until the date of his/her first AD diagnosis, loss to follow-up, or 2 years (730 days) after the baseline, whichever happens first.

Causal contrast Intention-to-treat effect Observational analog of intention-to-treat effect

We have a large number of drug candidates, and for each drug we we conduct a target trial following the above protocol to estimate its effect.

We emulate trials for all the drugs in our database with ≥ 500 patients in the trial drug group, and for each drug we emulate 100 trials by constructing different control groups as follows.

For each emulated trial, the treated group consists of patients who were eligible and adopted the trial drug strategy according above protocol, and its control groups consist of eligible patients either from randomly chosen drug groups other than the trial drug group, or from similar drug groups within the same second level ATC category as the trial drug, and we further exclude any of them who was also in the trial drug group or who prescribed the trial drug before baseline.

Intention-to-treatment analysis as the time-tofirst event Applying IPTW to adjust for baseline confounders Non-parametric bootstrapping for 95% CIs Same intention-to-treat analyses. Applying different ML/DL-based PS models to adjust for high-dimensional baseline confounders by IPTW. Different PS model selection strategies are investigated Adjusted 2-yr survival difference by adjusted KM method, and adjusted HRs by adjusted CoxPH, and we report sample means with 95% bootstrapped CIs for balanced trials from high-throughput emulations Sensitivity: estimated effects by building different control groups (random controls or ATC-L2 controls), and by different balance criteria.

MCI, mild cognitive impairment; AD, Alzheimer's disease; KM, Kaplan-Meier; HR, hazard ratio; CoxPH, Cox proportional hazards; CIs, confidence intervals; ML/DL, machine learning or deep learning; IPTW, inverse-probability treatment weights; PS, propensity score 18 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted February 1, 2022. Propensity score models selected by our model selection strategy balanced significantly more trials than other model selection methods for all target drugs. We reported drug trials with at least 10% balanced trials based on 100 emulated trials for each drug. The error bars mean 95% confidence intervals by 1000-times bootstrapping. The (two-sided) independent two-samples T-test for testing the means of each two bars, and *, p < 0.05; **, p < 0.01; ***, p < 0.001; not significant (ns), p ≥ 0.05; MLP-PS, multi-layer perceptron-based propensity score models; GBT-PS, gradient boosted tree-based propensity score models.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. We reported drugs with at least 10% balanced trials based on 100 emulated trials for each drug. Box plots with 25th (Q1, lower quartile), median (central vertical line), 75th (Q3, upper quartile), and whiskers extending to ±1.5× interquartile range (IQR=Q3-Q1). Triangle marks represent sample means. The (two-sided) independent two-samples T-test for testing the means of each two bars, and *, p < 0.05; **, p < 0.01; ***, p < 0.001; not significant (ns), p ≥ 0.05; AUC, the area under the receiver operating characteristic curve; LR-PS, regularized logistic regression-based propensity score models; LSTM-PS, long short-term memory network with attention mechanisms-based propensity score models.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted February 1, 2022. . Triangle marks represent sample means. The (two-sided) independent two-samples T-test for testing the means of each two bars, and *, p < 0.05; **, p < 0.01; ***, p < 0.001; not significant (ns), p ≥ 0.05; AUC, the area under the receiver operating characteristic curve; MLP-PS, multi-layer perceptron-based propensity score models; GBT-PS, gradient boosted tree-based propensity score models.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted February 1, 2022. for all target drugs. We applied LR-PS to all drug trials with ≥ 500 treated patients in MarketScan and we reported drug trials among which 50% trials were balanced after re-weighting based on 100 emulations trials. We applied LSTM-PS to drug candidates selected by LR-PS and reported drugs with 10% balanced trials because LSTM-PS

is not scalable to all existed drugs in MarketScan as in LR-PS case. We required one re-weighted trial to be balanced if all high-dimensional covariates were balanced after IPTW. The error bars mean 95% confidence intervals by 1000-times bootstrapping. The (two-sided) independent two-samples T-test for testing the means of each two bars, and *, p ; IPTW, inverse-probability treatment weights; PS, propensity score.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted February 1, 2022. (2) val_maxsmd, by maximum SMD after IPTW on the validation set; (3) val_nsmd, by the number of unbalanced feature after IPTW on the validation set; (4) train_maxsmd, by the maximum SMD after IPTW on the training set; (5) train_nsmd, by the number of unbalanced feature after IPTW on the training set; (6) trainval_maxsmd, by the maximum SMD after IPTW on the training and validation combined set; (7) trainval_nsmd, by the number of unbalanced feature after IPTW on the training and validation combined set; (8) trainval_final, our model selection strategy based on both the number of unbalanced feature after IPTW on the training and validation combined set and AUC score on the validation set. We reported drug trials with at least 10% balanced trials based on 100 emulated trials for each drug. The error bars mean 95% confidence intervals by 1000-times bootstrapping. The (two-sided) independent two-samples T-test for testing the difference between each method versus our final strategy, and *, p < 0.05; **, p < 0.01; ***, p < 0.001; not significant (ns), p ≥ 0.05; LR-PS, regularized logistic regression-based propensity score models; LSTM-PS, long short-term memory network with attention mechanisms-based propensity score models 7 ; IPTW, inverse-probability treatment weights; PS, propensity score.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted February 1, 2022. Propensity score models models selected by our model selection strategy balanced significantly more trials than other model selection methods for all target drugs. We reported drug trials with at least 10% balanced trials based on 100 emulated trials for each drug. The error bars mean 95% confidence intervals by 1000-times bootstrapping. The (two-sided) independent two-samples T-test for testing the means of each two bars, and *, p 24 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted February 1, 2022. ; https://doi.org/10. 1101 /2022 Supplemental Materials for 566 Drug repurposing driven by emulating trials on real-world data and 567 AI: a case of Alzheimer's disease 568 1 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted February 1, 2022. ; https://doi.org/10.1101/2022.01.31.22270132 doi: medRxiv preprint Supplementary Table 2. Selected ICD-9/10 diagnosis codes for cognitive impairment (MCI) and Alzheimer's Disease (AD).

Usage: The definition of MCI in real-world healthcare data for selection of targeted population. ICD-9 codes: 331.83 Mild cognitive impairment, so stated 294.9 Unspecified persistent mental disorders due to conditions classified elsewhere ICD-10 codes: G31.84 Mild cognitive impairment, so stated F09 Unspecified mental disorder due to known physiological condition To select patients with any of above codes in database Python code: str.startswith (('331.83', '294.9', 'G31.84', 'F09', '33183', '2949', 'G3184') AD Usage: The definition of AD in real-world healthcare data for selection of eligible individuals before baseline and identification of outcome in follow-up. ICD-9 codes: 331.0 Alzheimer's disease ICD-10 codes: G30 Alzheimer's disease G30.0 Alzheimer's disease with early onset G30.1 Alzheimer's disease with late onset G30.8 Other Alzheimer's disease G30.9 Alzheimer's disease, unspecified To select patients with any of above codes in database Python code: str.startswith(('331.0', '3310', 'G30'), na=False)

Usage: The definition of AD related dementias in real-world healthcare data for selection of eligible individuals before baseline. ICD-9 codes: 294.10 Dementia in conditions classified elsewhere without behavioral disturbance 294.11 Dementia in conditions classified elsewhere with behavioral disturbance 294.20 Dementia, unspecified, without behavioral disturbance. 294.21 Dementia, unspecified, with behavioral disturbance 290.* Dementias ICD-10 codes: F01.* Vascular dementia F02.* Dementia in other diseases classified elsewhere F03.* Unspecified dementia To select patients with any of above codes in database Python code: str.startswith (('F01', 'F02', 'F03', '290','294.10', '294.11', '294.20', '294.21', '2941', '29411', '2942', '29421') , na=False) MCI, mild cognitive impairment; AD, Alzheimer's disease; ICD-9/10, the International Classification of Diseases 9th or 10th Revision.

3 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted February 1, 2022. ; https://doi.org/10.1101/2022.01.31.22270132 doi: medRxiv preprint

24%) 63,397 (85.76%) -MCI age, median (IQR) c 66

01%) 3,849 (36.55%) 29,424 (46.41%) -Antidiabetic medication 19

60%) 6,001 (56.99%)

434 patients MCI AD MCI \ AD P-value No. of patients 424,961 (100%) 67,973 (16.00%) 356,988 (84.00%) -MCI age, median (IQR) c 64

29%) 39,424 (58.00%) 191,308 (53.59%) 0.000 b Sex-male 194,229 (45.71%) 28,549 (42.00%) 165,680 (46.41%) -Antidiabetic medication 65

Antihypertensives medication 161,904 (38.10%) 33,403 (49.14%)

38%) 3,430 (5.05%)

20%) 21,745 (31.99%)

484 (66.47%) 59,938 (88.18%)

83%) 27,784 (40.88%) 94,747 (26.54%) 0.000 Tobacco Use

03%)

T-test for the null hypothesis that two independent samples (population with AD diagnosis v.s. population without any AD diagnosis) have identical average values, except for sex b Chi-square test of independence of the observed male and female frequencies. c MCI age is the sample median with inter-quartile range

This study has several limitations. First, we identified MCI patients and AD onsets using ICD codes (Supplementary year standardized AD-free survival differences and hazard ratios after inverse probability of treatment re-weighting (IPTW) by regularized logistic regression-based PS model (LR-PS) using our proposed model selection strategy, adjusted for 267 covariates in total: age, sex, diagnoses codes, medications, and the time from MCI initiation date to the trial drug initiation date. Covariates were collected during baseline period. Drugs were ranked by the estimated 2-yr survival differences after IPTW. b We selected drugs with at least 50% emulated trials were balanced and for each balanced trial all the unbalanced features were balanced after IPTW. c Control groups are constructed randomly, either from alternative drug cohorts or similar drug cohorts under ATC L2. We set number of patients in the control group to maximum 3-folds as the treated group and we report the mean number of all balanced trials here. All statistics were sample means over balanced trials. Bootstrapped p-values for one-sample T-test and 1,000 bootstrapped 95% confidence interval were reported here. *, p < 0.05; **, p < 0.01; ***, p < 0.001; not significant (ns), p ≥ 0.05; AD, Alzheimer's disease, MCI, mild cognitive impairment; IPTW, inverse probability of treatment re-weighting; CI, confidence interval. 46.0 0.0 -0.9 (-1.3,-0.5) * * 1.19 (1.11,1.28) * * a 2-year standardized AD-free survival differences and hazard ratios after inverse probability of treatment re-weighting (IPTW) by regularized logistic regression-based PS model (LR-PS) using our proposed model selection strategy, adjusted for 267 covariates in total: age, sex, diagnoses codes, medications, and the time from MCI initiation date to the trial drug initiation date. Covariates were collected during baseline period. Drugs were ranked by the estimated 2-yr survival differences after IPTW. b We selected drugs with at least 10% emulated trials were balanced and we require all covariates of balanced trial should be balanced (compared with a tolerance of 2% unbalanced covariates in our primary analyses) after IPTW. c Control groups are constructed randomly, either from alternative drug cohorts or similar drug cohorts under ATC L2. We set number of patients in the control group to maximum 3-folds as the treated group and we report the mean number of all balanced trials here. All statistics were sample means over balanced trials. Bootstrapped p-values for one-sample T-test and 1,000 bootstrapped 95% confidence interval were reported here. *, p < 0.05; **, p < 0.01; ***, p < 0.001; not significant (ns), p ≥ 0.05; AD, Alzheimer's disease, MCI, mild cognitive impairment; IPTW, inverse probability of treatment re-weighting; CI, confidence interval.