Semi-supervised Calibration of Risk with Noisy Event Times (SCORNET) Using Electronic Health Record Data Semi-supervised Calibration of Risk with Noisy Event Times (SCORNET) Using Electronic Health Record Data Yuri Ahuja, Liang Liang, Selena Huang, Tianxi Cai January 9, 2021 Abstract Leveraging large-scaleelectronichealthrecord (EHR)data toestimatesurvival curves forclinical events canenablemorepowerfulriskestimationandcomparativeeffectivenessresearch.However,useofEHRdata is hindered by a lack of direct event times observations. Occurrence times of relevant diagnostic codes or target disease mentions in clinical notes are at best a good approximation of the true disease onset time. On the other hand, extracting precise information on the exact event time requires laborious manual chart reviewand is sometimesaltogether infeasibleduetoa lackofdetaileddocumentation.Currentstatus labels – binary indicators of phenotype status during follow up – are significantly more efficient and feasible to compile, enablingmoreprecise survival curveestimationgiven limitedresources.Existingsurvivalanalysis methodsusingcurrentstatus labels focusalmostentirelyonsupervisedestimation,andnaiveincorporation of unlabeled data into these methods may lead to biased results. In this paper we propose Semi-supervised CalibrationofRiskwithNoisyEventTimes(SCORNET),whichyieldsaconsistentandefficientsurvivalcurve estimator by leveraging a small size of current status labels and a large size of imperfect surrogate features. In addition to providing theoretical justification of SCORNET, we demonstrate in both simulation and real- worldEHRsettingsthatSCORNETachievesefficiencyakintotheparametricWeibullregressionmodel,while alsoexhibitingnon-parametricflexibilityandrelativelylowempiricalbiasinavarietyofgenerativesettings. 1 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425976doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425976 http://creativecommons.org/licenses/by-nc/4.0/ 1 Introduction The Electronic Health Record (EHR) has in recent years become an increasingly available source of data for clinicalandtranslational research(Kohaneandothers, 2012;HripcsakandAlbers, 2012;Miottoandothers, 2016). Comprisingheterogeneousclinicalencounters includingdiagnosticandproceduralbillingcodes, labtests,pre- scriptions, and free text clinical notes for millions of patients, these rich data offer abundant opportunities for insilicoepidemiologicalanalysis.Oneapplicationthathasgarneredrecent interest isestimationofpopulation disease risk within EHR patient cohorts, which can enable more powerful and precise estimation of real-world disease risks as well as comparative effectiveness analysis of alternative treatment strategies (Hodgkins and others, 2017; Dean and others, 2003; Liu and others, 2018; Panahiazar and others, 2015; Steele and others, 2018). Several studies have had success estimating time to death within rule-defined disease cohorts (Panahiazar and others, 2015; Steele and others, 2018). However, estimating the temporal risk of developing a disease is a more challenging task due to EHR’s lack of direct observations of either disease status or the timing of disease on- set.Convenientproxiesofdiseasestatusoronset timebasedonreadilyavailable featuressuchas International Classification of Disease (ICD) codes often exhibit low specificity and systematic temporal biases, potentially yielding highly biased disease risk estimators if used as event time labels (Cipparone and others, 2015; Uno and others, 2018). On the other hand, extracting precise information on disease outcomes requires labor-intensive manualchartreview,whichisparticularlychallengingforeventtimessincetheeventmayoccuroutsideof the hospital system and only be mentioned during follow-up visits. It is thus only practically feasible to annotate the current status Δ = �() ≤ �) of the event time), where � is the follow up time. In this paper, we consider the problem of estimating the disease risk �(C) = %() ≤ C) when only a small numberoflabelsonΔ andalargequantityofunlabeledEHRfeaturesW, includingproxiesof),areavailable.Su- pervisedsurvival curveestimationwithcurrentstatusdataon {Δ,�} iswell established inthestatistical liter- ature with several available parametric, semi-parametric and non-parametric procedures (Vardi, 1982; Huang, 2 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425976doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425976 http://creativecommons.org/licenses/by-nc/4.0/ 1996; van der Laan and Robins, 1998; van der Laan and Jewell, 2003; Lin and others, 2019, e.g.). For example, van der Laan and Robins (1998) proposed a non-parametric, locally efficient estimator via inverse probability of censoring weighting (IPCW), assuming that (1)) and � are conditionally independent given some informa- tive baseline covariates Z0 ⊂ W (e.g. age, sex, etc.) and (2) a consistent estimator for the conditional density of � | `0 is available.However, theseexistingestimatorsdonot leverageunlabeledEHRfeature informationsuch as time to first surrogate ICD code, which may greatly improve risk estimation precision. SinceWmaybehighlypredictiveof), theestimationof((C)canpotentiallybeimprovedviasemi-supervised learning (SSL) leveraging both the small set of Δ observations in the labeled set and the EHR features W in the unlabeledset. SSLhasbeenshowntosignificantlymitigatebiasand/or improveefficiency forvariousriskpre- diction applications (Chai and others, 2017; Liang and others, 2016; Bair and Tibshirani, 2004; Golub and others, 1999). For example, several studies employ semi-parametric models to impute event times in the unlabeled set for subsequent input into an outcome survival model alongside labeled data (Chai and others, 2017; Liang and others, 2016; Zhao and others, 2014; Uno and others, 2018; Hassett and others, 2017; Chubak and others, 2012; Choi and others, 2015; Kaji and others, 2019; Ruan and others, 2019; Ahuja and others, 2020a). While such imputa- tion approaches may improve efficiency under correct specification of the imputation model, they are subject to significant bias if the imputation model is misspecified. In addition, these existing methods do not allow for useofcurrentstatus labels fortraining.Othergeneralaugmentedinverseprobabilityweightingmethodsinthe missing data literature (Seaman and White, 2013; Rotnitzky and Robins, 2014, e.g.) are not directly applicable here since the probabilities of labels being observed tend to zero in the SSL setting. We address this shortcoming by proposing Semi-supervised Calibration of Risk with Noisy Event Times (SCORNET) for estimation of ((C). SCORNET utilizes current status labels while also employing a robust semi- supervised imputation approach on the extensive unlabeled set to maximize survival estimation efficiency. To mitigateimputationbiasandmaximizeefficiencygainfromtheunlabeleddata,SCORNETutilizesahighlyflexi- 3 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425976doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425976 http://creativecommons.org/licenses/by-nc/4.0/ blesemi-non-parametrickernelregressionmodelwithEHRfeaturesascovariates,whichensuresthevalidityof the resulting risk estimator without requiring the imputation model to hold. In addition to providing theoret- ical justifications for the SCORNET estimator, we illustrate via simulation studies that SCORNET substantially outperforms existing methods with regards to the bias-variance tradeoff. The rest of the paper is organized as follows. In Section 2, we detail the SCORNET procedure along with its associated inference procedures. In Section3, wereport riskestimation performancerelative toexisting methods in diversesimulation studies.To further illustrate the utility of SCORNET in clinical applications, we apply it to a real-world EHR study estimat- ing the risk of heart failure among rheumatoid arthritis patients in Section 4. Finally, in Section 5 we briefly discuss the strengths, weaknesses, and potential applications of SCORNET. 2 Methods 2.1 Setup Let) denotetheeventtimeforwhichweareinterestedinestimatingacumulativedistributionfunction �(C) = %() ≤ C) and survival function ((C) = 1−�(C). In the EHR study we do not observe) but ratherΔ = �() ≤ �) for a small labeled subset, where � is the follow up time with finite support [0, g2]. For all subjects, we also observe a set of baseline covariates `0 and longitudinal EHR features `. Since codes used in the EHR are often highlysensitivebutnotspecific, thereoftenexistssomefiltervariableF ∈ {0, 1} suchthatΔ8 | (F8 = 0,W8) = 0 almostsurely,whereW8 = (`T0,8, ` T 8 )T.Moreover,weassumethat (), `)� | `0.Weassumethatdataforanalysis consistofasmall setof= current-status-labeledobservationsrandomlyselectedamongthosewithF = 1 along with a larger set of # unlabeled observations:D = {D8 = (�8,+8Δ8,W8,+8,F8)T, 8 = 1, ..., #} = L∪U, where L = {(�8,Δ8,W8, 1, 1)T : F8 = 1,+8 = 1, 8 = 1, ...,=} andU = {(�8, 0,W8, 0,F8)T : +8 = 0, 8 = = + 1, ..., #} with log(#)/log(=) → a0 > 3/2 as = →∞. 4 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425976doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425976 http://creativecommons.org/licenses/by-nc/4.0/ Sincethecensoring� maydependon `0,we followthe IPCWstrategyofvanderLaanandRobins (1998) to weight observations by lC,1(� | `0) = 1(� − C) 52(C | `0) where 1(B) = (B/1)/1, 52(C | `0) = 3�2(C | `0)/3C, �2(C | `0) = %(� ≤ C | `0), (·) is some symmetric density function, and 0 < 1 = $(=−a) is the bandwidth thereof with a ∈ (1/5, 1/3]. IPCW enables consistent estimation of functionals of) ≤ C and W since for any reasonable choice of function @(·) and 0, 3 ∈ {0, 1}, � { Δ 3 8 @(W8)F 0 8 lC,1(�8 | `0,8) } = � { �()8 ≤ C)3@(W8)F08 } +$(12). (1) The IPCW estimator for c(C) = %()8 ≤ C | F8 = 1) proposed by van der Laan and Robins (1998) essentially corresponds to (C) = ∑= 8=1 F8Δ8lC,1(�8 | `0,8)∑= 8=1 F8lC,1(�8 | `0,8) with 52(C | `0) in lC,1(�8 | `0,8) replaced by a consistent estimator that converges faster than =−1/4, which is not difficult to achieve under reasonable modeling assumptions since �8 | `0,8 can be estimated using the full data D. To this end, we propose to derive an estimator for the conditional density 52(C | `0) = _2(C | `0,8)(2(C | `0,8)byimposingasemi-parametricmodel for� | `0.Althoughmanycommonlyemployedmodels can be used since once again � is fully observed for all patients, we illustrate our proposal by focusing on the Cox proportional hazards model (Cox, 1972) under which _2(C | `0,8) = _02(C)4$ T`0,8 and (2(C | `0,8) ≡ 1 −�2(C | `0,8) = exp { −Λ02(C)4$ T`0,8 } , (2) where _2(C | `0,8) is the conditional hazard function of �8 | `0,8, _02(C) is the unknown baseline hazard func- tion, Λ02(C) = ∫ C 0 _02(B)3B, and $ is the vector of unknown covariate effects. 2.2 SCORNET Estimation As outlined in Figure 1, SCORNET consists of three steps: (1) estimating the conditional censoring distribution ℎ (C | `0)usingD; (2)fittingan imputationworkingmodel for c(C | W) ≡ %() ≤ C | W,F = 1)usingL, denoting 5 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425976doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425976 http://creativecommons.org/licenses/by-nc/4.0/ theestimateof c(C | W) as (C | W); and (3)estimating ((C) bymarginalizing (C | W)F+Δ (1 −F) = (C | W)Fvia IPCW. Figure 1: Schematic of the SCORNET algorithm. 2.2.1 Step 1: Estimate 52 (C | `0) Under the Cox Model for � | `0 Toestimate 52(C | `0),wefittheCoxmodel (2) tothefulldataD toobtainthepartial likelihoodestimator $̂ for $. We subsequently estimate Λ0(C) and _02(C) respectively as the standard Breslow estimator Λ̂02(C) and the kernel-smoothed Breslow estimator _̂02(C) (Basha and Hoxha, 2019), where Λ̂02(C) = #∑ 9=1 �(�9 ≤ C)∑# 8=1 �(�8 ≥ �9)exp ( $̂T`0,8 ) , _̂02(C) = #∑ 9=1 0# ( �9 − C )∑# 8=1 �(�8 ≥ �9)exp ( $̂T`0,8 ) , and 0# = $(#−a2) with a2 ∈ (1/5, 1/3]. We then obtain _̂2(C | `0) = _̂02(C)4$̂ T`0,8 and (̂2(C | `0) = exp { −Λ̂02(C)4$̂ T`0,8 } , and we estimate 52(C | `0,8) as 5̂2(C | `0,8) = _̂2(C | `0,8)(̂2(C | `0,8). Following standard asymptotic results for non-parametric kernel regression (Pagan and Ullah, 1999), it is not difficult to show that sup`0,C | 5̂2(C | 6 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425976doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425976 http://creativecommons.org/licenses/by-nc/4.0/ `0) − 52(C | `0)| = $?{log(#)1/2(#0#)−1/2} = >?(). We denote the resulting estimate for the censoring weight as l̂1=(C | `0,8) = 1=(�8 − C)/ 5̂2(C | `0,8). 2.2.2 Step 2: Estimate an Imputation Model c(C | W8) ≡ % ()8 ≤ C | W8,F8 = 1) To leverage the unlabeled data, we fit a flexible imputation working model c(C | W8) = 6 { "(C) + #0(C) T ®̀0,8 + #(C)T`8 } = 6 { )(C)T ®W8 } (3) where `8 denotes theEHRsurrogate features, )(C) = ( U(C), #0(C)T, #(C)T )T, and ®W8 = (1, ®̀T0,8, `T8 )T.Under (3), %()8 ≤ C | W8,F8 = 1) = 6 { )(C)T ®W8 } , and hence we may estimate )(C) as )̂(C) = ( Û(C), #̂0(C)T, #̂(C)T )T , the solution to the IPCW estimating equation evaluated withL, =∑ 8=1 l̂C,1=(�8 | `0,8)F8 ®W8 { Δ8 −6 ( )T ®W8 )} = 0, where 1= = $(=−a) with a ∈ (1/5, 1/2). In practice, 1= can be chosen via either standard cross-validation or heuristicplug-invalues.ForafutureobservationwithfilterstatusF8 = 1 andcovariatesW8,weimpute �()8 ≤ C) as the conditional risk ĉ(C | W8) = 6 { )̂(C)T ®W8 } . It is not difficult to show that )̂(C) converges in probability to )̄(C), the solution to the limiting estimating equation � [ ®W8 { �()8 ≤ C) −6()T ®W8) } | F8 = 1 ] = 0, which ensures that � {c̄(C | W8) | F8 = 1} = %()8 ≤ C | F8 = 1), where c̄(C | W8) = 6{)̄(C)T ®W8}, (4) regardless of the adequacy of the imputation model (3). 2.2.3 Step 3: Estimate �(C) by Marginalizing Imputed Risks Finally, we marginalize the imputed values ĉ8C = ĉ(C | W8) ∀F8 = 1 and Δ8 = 0 ∀F8 = 0 to estimate �(C). Since F8 depends on �8, we again employ IPCW to marginalize ĉ8CF8 +Δ8(1 −F8) = ĉ8CF8 and thereby construct our 7 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425976doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425976 http://creativecommons.org/licenses/by-nc/4.0/ final estimator for �(C): �̂(C) = ∑# 8=1{ĉ8CF8 +Δ8(1 −F8)}l̂C,0# (�8 | `0,8)∑# 8=1 l̂C,0# (�8 | `0,8) = ∑# 8=1 ĉ8CF8l̂C,0# (�8 | `0,8)∑# 8=1 l̂C,0# (�8 | `0,8) . 2.3 Inference for �̂(C) Following standard theory for non-parametric kernel regression (Pagan and Ullah, 1999), we show in the Sup- plementary Materials that �̂(C) → %()8 ≤ C,F8 = 1) + %()8 ≤ C,F8 = 0) = %()8 ≤ C) = �(C) in probability under mild regularity conditions and correct specification of the censoring model regardless of the adequacy of the imputation model. Here, we note that for any C ∈ [0, g2], 0 = %(Δ8 = 0 | F8 = 0,�8 = C,W8) implies that %()8 ≤ C | F8 = 0) = 0. Furthermore, (=1=)1/2 { �̂(C) −�(C) } = ( 1= = )1/2 =∑ 8=1 l̂C,1=(�8 | `0,8)F8 {Δ8 − c̄(C | W8)} +>?(1) = ( 1= = )1/2 =∑ 8=1 lC,1=(�8 | `0,8)F8 {Δ8 − c̄(C | W8)} +>?(1) since supC | 5̂2(C | `0) − 52(C | `0)| = >?(). It follows that (=1=)1/2{�̂(C) −�(C)} is asymptotically normal with mean 0 and variance f 2(C) = '( )�{V(C | `0,8)/ 52(C | `0,8)}, (5) where V(C | `0,8) = �[F8{�()8 ≤ C) − c̄(C | W8)}2 | `0,8] and '( ) = ∫ (G)23G. Our derivation for the asymptotic distribution of �̂(C) can effectively ignore the vari- ability associated with the estimation of censoring weights, which simplifies the asymptotic variance f2(C). Importantly, f2(C) decreases as the imputation model approximates c(C | W8) better since V(C | `0,8) = �[F8{�()8 ≤ C) − c(C | W8)}2 | `0,8] + �[F8{c(C | W8) − c̄(C | W8)}2 | `0,8] decreases. To estimate f2(C) in practice, one may construct a plug-in estimator, f̂ 2(C) = 1= = =∑ 8=1 l̂C,1=(�8 | `0,8) 2F8 { Δ8 −6 ( )̂(C)T ®W8 )}2 . 8 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425976doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425976 http://creativecommons.org/licenses/by-nc/4.0/ By contrast, the supervised IPCW estimator that incorporates filter negative patients takes the form �̂(C) = ∑# 8=1{ĉ(C)F8 +Δ8(1 −F8)}l̂C,0# (�8 | `0,8)∑# 8=1 l̂C,0# (�8 | `0,8) = ∑# 8=1 ĉ(C)F8l̂C,0# (�8 | `0,8)∑# 8=1 l̂C,0# (�8 | `0,8) . The asymptotic variance of (=1=)1/2{�̂(C) −�(C)} is then f2(C) = '( )�{V(C | `0,8)/ 52(C | `0,8)}, where V(C | `0,8) = � [ F8{�()8 ≤ C) −c(C)}2 | `0,8 ] . The variance f2(C) is equivalent to that of SCORNET if and only if the feature set W is uninformative for) (i.e. )W). Supervised IPCWisotherwise lessefficient,withrelativeefficiencycontrolledbytherelativemagnitudes of the marginal error �[{�()8 ≤ C) − �(C)}2 | F8 = 1, `0,8] and the conditional error �[{�()8 ≤ C) − c̄(C | W8)}2 | F8 = 1, `0,8]. 3 Simulation Study We conduct extensive simulation experimentation to evaluate the finite sample performance of the proposed SCORNET estimator in realistic settings with = ∈ {100, 200} observed labels within the set of filter-positive patients, defining the filter to have 99% sensitivity and 88% specificity for Δ. We compare SCORNET to three existing survival function estimators with current status data: 1) parametric Weibull Accelerated Failure Time (AFT) regression with interval event times (Lin andothers, 2019), 2) semi-parametric Cox Proportional Hazards regression with interval event times and Breslow baseline hazard estimation (Huang, 1996; Cox, 1972; Breslow, 1972), and3)non-parametric IPCWestimation(vanderLaanandRobins,1998).Weincorporate thefilter inthe Weibull and Cox models by setting Δ8 | (F8 = 0) = 0 and weighting the = labeled filter-positive patients by 1 = ∑# 8=1 F8. Weibull and Cox are implemented using the icenReg package in R, while IPCW is implemented per the algorithm detailed in van der Laan and Robins (1998), estimating � | `0 usingDunder the Cox model. We note that estimating the censoring distribution usingLyields similar asymptotic performance to usingD, but in finite sample settings the latter offers higher efficiency. 9 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425976doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425976 http://creativecommons.org/licenses/by-nc/4.0/ Setting `0 ∼ � | `0 ∼ ) | `0 ∼ ` ∼ 1 Unif(−1, 1) Weibull ( 104−0.5`0, 32 ) Weibull ( 154−0.3`0, 52 ) Normal{) + 1,f(�)/2} 2 Unif(−1, 1) Weibull ( 104−0.5`0, 32 ) Weibull ( 15, 52 ) Normal{) + 1,f(�)/2} 3 Unif(−1, 1) Weibull ( 104−0.5`0, 32 ) Weibull ( 154−0.3` 2 0, 52 ) Normal{) + 1,f(�)/2} 4 Unif(−1, 1) Weibull ( 104−0.5`0, 32 ) Logistic (15 − 4`0, 3) Normal{) + 1,f(�)/2} 5 Unif(−1, 1) Weibull ( 10, 32 ) Weibull ( 154−0.3`0, 52 ) Normal{) + 1,f(�)/2} 6 Unif(−1, 1) Weibull ( 104−0.5` 2 0, 32 ) Weibull ( 154−0.3`0, 52 ) Normal{) + 1,f(�)/2} Table 1: Generative parameters employed in our simulation study. Weconsider6diversegenerativemechanismsasdetailedinTable3,includingcaseswhereWeibull-distributed accelerated failure time of ) | `0, proportional hazards of ) | `0, and proportional hazards of � | `0 are re- spectively violated, as well as cases where SCORNET’s imputation model is and is not misspecified. In settings 1, 2, and 5, we consider various cases where SCORNET and all comparator methods are correctly specified. In setting 1 we consider the specific case where � and ) both depend on `0, and both � | `0 and ) | `0 are Weibull-distributed satisfying accelerated failure time and proportional hazards. In setting 2, by contrast, we consider a case where )`0 to assess robustness to over-parametrization of this relationship, and in setting 5 we consider a case where �`0 to evaluate robustness to over-parametrization thereof. In settings 3 and 4 we assess the benefit of SCORNET and IPCW’s robustness to the distribution of ) | `0 when this distribution sat- isfies neither Weibull accelerated failure time nor proportional hazards. We evaluate SCORNET’s sensitivity to misspecification of the imputation model in settings 1, 3, and 5, as compared to correct specification thereof in settings 2 and 4. Finally, in setting 6 we assess the sensitivity of SCORNET and IPCW to misspecification of theconditionalcensoringmodel� | `0. Foreachgivenconfiguration,wecomputetheempiricalbias, standard error, and root mean squared error (RMSE) of all estimators for �(C) based on their average performance on 10 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425976doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425976 http://creativecommons.org/licenses/by-nc/4.0/ 500 simulated datasets evaluated at 100 equally-spaced time points C ∈ [&�(0.1) + 1=,&�(0.9) − 1=], where & denotes the quantile function of � under the configuration. We used plug-in bandwidths of 1= = B̂(�)=−1/4 and 0# = B̂(�)#−1/4 for the imputation (Step 2) and marginalization (Step 3) steps of SCORNET respectively, where B̂ is the empirical standard deviation of observed �. We present the performance of the estimators av- eraged over the selected time points using = = 200 labels in Figure 2. The performance at each time point can be found in Supplementary Figure 1, and time-averaged performance using = = 100 labels can be found in Supplementary Table 1 of the Supplementary Materials. Figure 2: Time-averaged empirical absolute biases (left), standard errors (second from left), root relative efficiencies (second from right), and relative RMSEs (right) of the Weibull Accelerated Failure Time (red), Cox Proportional Hazards w/ Breslow baseline (blue), supervised IPCW (green), and SCORNET estimators using weakly informative (purple) and strongly informative (orange) surrogates, in various simulated settings with = = 200 observed current status labels. 11 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425976doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425976 http://creativecommons.org/licenses/by-nc/4.0/ As Figure 2 demonstrates, imputing using a strongly informative feature ` (SCORNET-Strong) results in consistently higher efficiency than just using the weakly informative baseline `0 (SCORNET-Weak), which in turn is markedly more efficient than not leveraging the unlabeled set at all (IPCW). SCORNET makes minimal assumptions regarding the distribution of ) | `0, settling for non-parametric efficiency in exchange for en- hanced flexibility. By contrast, the Weibull regression model fully parametrizes ) | `0, and the Cox model assumes proportional hazards thereof, increasing efficiency at the expense of bias in the case of misspecifica- tion. As expected, Weibull consistently achieves higher empirical efficiency than Cox, which in turn is more efficient than IPCW across settings. Notably, SCORNET consistently achieves empirical efficiency comparable to Weibull and significantly higher than Cox despite being far more flexible than both, again highlighting the efficiencygainedbyleveragingauxiliaryinformationtoimputeunobservedrisks.Atthesametime,SCORNETis muchlesssusceptibletomodelmisspecificationbias thanWeibull, asdemonstratedbythe latter’s significantly higherbiasandRMSEinSetting4. Indeed,SCORNETachievesrelatively lowmeanabsolutebiasacrosssettings, with MSE apparently dominated by variance rather than bias in the setting of 100-200 labels. Consistent with the theory, SCORNET is robust to misspecification of the imputation model in settings 1, 3, and 5, achieving equivalently insignificantbiasas insetting2andmarginallybutnotmeaningfullyhigherbias thaninsetting4. That said, correctnessof the imputationmodel insettings2and4doesnotyieldanymeaningful change inrel- ative efficiency, likely because inherent variability functionally dominates imputation model bias given so few labels. Reassuringly, SCORNET (and IPCW) appear insensitive to misspecification of � | `0 in setting 6, achiev- ing functionally equivalent bias to the correctly-specified Weibull and Cox models. Altogether, these results corroborate the assertion that SCORNET’s semi-supervised utilization of informative feature data to impute risks in the unlabeled set improves estimation efficiency without introducing bias regardless of the validity of the imputationmodel.Moreover, theysuggest thatSCORNETisparticularlyvaluable insettingswhere (1)flex- 12 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425976doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425976 http://creativecommons.org/licenses/by-nc/4.0/ ibility is desired with regard to the distribution of) | `0, and (2) there exists a large set of unlabeled patients with associated EHR data – both commonplace in retrospective observational studies. Figure3:Empiricalcoverageprobabilitiesaveragedovertime(left)andplottedovertime(right)ofSCORNET-Strong’s95% confidence intervals constructed with the bootstrap (red) and plug-in (blue) standard error estimators in various simu- lated settings with = = 200 observed current status labels. See Table 1 for details of the generative mechanism employed in each setting. Toassessthefinitesampleperformanceoftheproposedintervalestimationprocedures,weobtainstandard errorestimatesbothusingtheproposedplug-inestimator f̂(C) andviabootstrapwith500replicates. InFigure 3 we demonstrate empirical coverage probabilities of SCORNET’s 95% Wald confidence intervals constructed using each standard error estimator, both averaged over the selected timepoints (left) and at each timepoint (right). Reassuringly, we find that the 95% confidence intervals using both plug-in and bootstrap estimators achievenearly 95% meancoverageacrosssettings.Coverageonlydropsbelow 90% at thetailsof theevent time 13 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425976doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425976 http://creativecommons.org/licenses/by-nc/4.0/ support due to moderately increased bias from kernel smoothing thereabout. The plug-in estimator achieves marginally lower coverage than the bootstrap estimator at the right tail due to underestimation of the true standard error, likely because of overfitting of the imputation model given low local censoring density (and thus low effective #). Notably, we do not observe this trend in setting 4, wherein correct specification of the imputation model obviates overfitting. Thus, we posit that the plug-in estimator can be reliably used for fi- nite sample problems with = ∈ [100, 200] labels as long as the conditional censoring density 52(� | `0) is sufficiently high and the timepoints evaluated are sufficiently far from the tails of the event time support. 4 Application to Assessing Heart Failure Risk Among Rheumatoid Arthritis Patients Rheumatoidarthritis (RA),achronicinflammatorydiseasethataffectsapproximately 1% ofthegeneralpopula- tion, is associated with dramatically increased risk of heart failure (HF) morbidity and mortality (Kaplan, 2010; Nicola andothers, 2005, 2006; Ahlers andothers, 2020). One study estimated that RA patients have a 1.9-fold life- time risk of developing HF compared to matched RA-negative controls (Nicola andothers, 2005), while another estimated that HF accounts for 13% of excess mortality among RA patients (Nicola and others, 2006). Ongoing interest lies in estimating the risk of developing HF subtypes in RA cohorts and quantifying the risk modifying effect of various RA treatments (Ahlers and others, 2020). Due to the increased availability of electronic health record (EHR) data, it is now possible to assess HF risk for a broader RA population using these data. For ex- ample, at Mass General Brigham we previously established an EHR cohort of #0 = 16, 358 RA patients (Huang and others, 2020). This large RA cohort can potentially be used to study the longitudinal risk of HF among RA patients. However, such analysis is not straightforward as HF status is not readily available within the RA cohort. We propose to estimate HF risk among RA patients by leveraging (1) = current status labels on HF status obtained 14 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425976doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425976 http://creativecommons.org/licenses/by-nc/4.0/ via manual chart review, and (2) informative yet unlabeled EHR data, including time to first ICD code for HF, as surrogate variables `. We estimate both the age-specific HF risk, �age(·), and the risk of developing HF after the patient’s incident ICD code for RA (714), �RA+(·), among patients with at least 6 months of follow up whose incident RA codes occur after the age of 16 to select for adult-onset as opposed to juvenile RA. Among filter- positive patients, defined as having at least 1 ICD code for HF, we have = = 300 labels on censoring time HF status Δ for age-specific HF risk, and we have = = 126 for post-RA HF risk. We let the baseline covariates `0 includesexanddecadeoffirstEHReventfor �age(·), andsex,decadeoffirstRAcode,andageatfirstRAcodefor �RA+(·).WeobtainHFriskestimatorsbasedonSCORNETaswellas theaforementionedcomparatorestimators. For the imputation model in Step 2 of SCORNET, we consider three EHR-derived surrogate risk predictors for `: (1) thepredictedΔ basedontheunsupervisedMultimodalAutomatedPhenotyping(MAP)algorithm,which uses the total counts of HF ICD codes and mentions of HF in clinical notes, as well as the total count of all ICD codesasahealthcareutilizationmeasure (Liaoandothers, 2019), (2) thepredictedΔ basedontheunsupervised Surrogate-guidedEnsembleLatentDirichletAllocation(sureLDA)algorithm,which leverages thefeaturesused in MAP as well as 121 additional manually-selected EHR features including counts of relevant medications, ICD codes, and concept unique identifiers (CUIs) in clinical notes (Ahuja and others, 2020b); and (3) the time to first HF ICD code. As in our simulation, we select plug-in bandwidths of 1= = B̂(�)=−1/4 and 0# = B̂(�)#−1/4 for the imputation and marginalization steps of SCORNET respectively, and we evaluate risk at 100 timepoints C ∈ [&�(0.1) +1=,&�(0.9)−1=]. We again compare the performance of SCORNET to that of Weibull, Cox, and IPCW,incorporatingthefilter intheWeibullandCoxmodelsbypropensityweightingaswedointhesimulation study. In Figure 4, we show the estimated HF risk curves along with their standard errors. Reassuringly, all meth- ods appear to agree rather closely for estimation of both age-specific HF risk and HF risk after RA diagnosis. For the latter quantity, however, Weibull and Cox appear to underfit while IPCW appears to overfit relative to 15 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425976doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425976 http://creativecommons.org/licenses/by-nc/4.0/ Figure4:Estimatedage-specificandpost-RAcumulativerisksofheart failure(top)andbootstrapstandarderrorsthereof (bottom)overtimeoftheWeibullAcceleratedFailureTime(red,short-long-dashed),CoxProportionalHazardsw/Breslow baseline (blue, dot-dashed), supervised IPCW (green, dashed), and SCORNET (purple, solid) estimators. theSCORNETestimator,whichappears toachieveareasonablemiddleground.Moreover,SCORNETonceagain attainsstandarderrorscomparable to thoseof theWeibull estimatorandmeaningfully lowerthanthoseof the Cox and IPCW estimators. This suggests that while the Weibull and Cox models potentially fail to capture the complexity of the post-RA HF risk function, and IPCW is too unstable for a limited labeled set of size = = 126, SCORNET offers an attractive balance of efficiency and flexibility and is thus well conditioned for such a sce- nario. As shown in Figure 5, averaged over the selected timepoints, the root relative efficiency of SCORNET is 1.11, 2.55, and 3.31 compared to the Weibull, Cox, and IPCW estimators respectively for estimation of age- specific risk, and 1.34, 2.32, and 3.85 respectively for estimation of HF risk after RA diagnosis. Once again, the fact that SCORNET achieves efficiency moderately higher than the relatively inflexible Weibull model and sig- 16 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425976doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425976 http://creativecommons.org/licenses/by-nc/4.0/ nificantly higher than the Coxand IPCW estimators reflects the value of leveraging available information from the EHR to bolster risk estimation efficiency. Figure 5: Time-averaged bootstrap standard errors (left) and empirical root relative efficiencies (right) of the Weibull AcceleratedFailureTime(red),CoxProportionalHazardsw/Breslowbaseline(blue), IPCW(green),andSCORNET(purple) estimators for estimation of (1) age-specific HF risk (left), and (2) HF risk after RA diagnosis (right), among RA patients in the Partners EHR database. 5 Discussion By leveragingasizeableunlabeleddatasetcontaining imperfect surrogatesof the trueevent timesandasmall set with observed current status labels, the SCORNET estimator serves as a robust and efficient alternative to existing model-free survival estimators with current status data. The semi-supervised nature of SCORNET makes it well-conditioned to EHR-based survival estimation in settings where only a limited number of labels are available or readily obtainable. Moreover, by only requiring current status labels rather than the precise 17 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425976doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425976 http://creativecommons.org/licenses/by-nc/4.0/ timing of event onset, SCORNET greatly reduces the burden of chart review and increases the feasibility of studying disease risk using EHR data. To allow for covariate-dependent censoring, which is frequently present in observational settings, SCOR- NET requires additional assumptions on the distribution of � | `0. Although we choose the proportional haz- ards model for illustration, any standard semi-parametric model will yield similar properties for the resulting estimator. Since {�, `0} are observed for all subjects, one can potentially allow for more flexible (i.e. non- parametric) censoring models. That said, our simulation results suggest that SCORNET is relatively insensitive to misspecification of the model for � | `0. Even under mild misspecification, it achieves consistently lower mean squared errors than existing estimators. When interest lies in assessing how risk differs across different patient sub-populations, it is straightfor- wardtoextendSCORNETtoestimatesubgroup-specificrisks forasmallnumberof subgroups.However, future research is warranted to estimate covariate-specific risks for a general set of covariates. 6 Software An R package, including a sample use case and complete documentation, is available at https://cran.r-project.org/web/packages/SCORNET/index.html. Source code can be found at https://github.com/celehs/SCORNET. Funding ThisworkwassupportedbytheU.S.National InstitutesofHealthGrantsT32-AR05588512,T32-GM7489714,and R21-CA242940. 18 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425976doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425976 http://creativecommons.org/licenses/by-nc/4.0/ Acknowledgements The authors declare no conflicts of interest. References Ahlers, Michael J., Lowery, Brandon D., Farber-Eger, Eric, Wang, Thomas J., Bradham, William, Orm- seth, Michelle J., Chung, Cecilia P., Stein, C. Michael and Gupta, Deepak K. (2020). Heart failure risk associated with rheumatoid arthritis-related chronic inflammation. Journal of the American Heart Association, 9. Ahuja, Yuri, Hong, Chuan, Xia, Zongqi and Cai, Tianxi. (2020a). Samgep: A novel method for prediction of phenotype event times using the electronic health record. Preprint. Ahuja, Yuri, Zhou, Doudou, He, Zeling, Sun, Jiehuan, Castro, Victor M, Gainer, Vivian, Murphy, Shawn N, Hong, Chuan and Cai, Tianxi. (2020b). surelda: A multidisease automated phenotyping method for the electronic health record. Journal of theAmericanMedical InformaticsAssociation 27(8), 1235–1243. Bair, Eric and Tibshirani, Robert. (2004). Semi-supervised methods to predict patient survival from gene expression data. PLoSBiology 2(4), E108. Basha,LuleandHoxha,Fatmir. (2019). Kernelestimationofthebaselinefunctioninthecoxmodel. European Scientific Journal 15(6), 105–118. Breslow,NormanE. (1972). Discussionofprofessorcox’spaper. Journalof theRoyalStatisticalSociety,SeriesB34, 216–217. Chai,Hua,Li,Zi-na,Meng,De-yu,Xia,Liang-yongandLiang,Yong. (2017). Anewsemi-supervisedlearning model combined with cox and sp-aft models in cancer survival analysis. ScientificReports 7(13053). 19 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425976doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425976 http://creativecommons.org/licenses/by-nc/4.0/ Choi, Edward, Du, Nan, Chen, Robert, Song, Le and Sun, Jimeng. (2015). Constructing disease network and temporal progression model via context-sensitive hawkes process. IEEE Computer Society. pp. 101–108. Chubak, Jessica, Yu, Onchee, Pocobelli, Gaia, Lamerato, Lois, Webster, Joe, Prout, Marianna N, Yood, Marianne Ulcickas, Barlow, William E and Buist, Dianna SM. (2012). Administrative data algorithms to identify second breast cancer events following early-stage invasive breast cancer. Journal of the National Cancer Institute 104(12), 931–940. Cipparone, Charlotte W, Withiam-Leitch, Matthew, Kimminau, Kim S, Fox, Chet H, Singh, Ranjit and Kahn, Linda. (2015). Inaccuracy of icd-9 codes for chronic kidney disease: A study from two practice-based research networks (pbrns). The Journal of theAmericanBoardof FamilyMedicine 28(5), 26094. Cox,DavidR. (1972). Regressionmodelsandlife-tables. Journalof theRoyalStatisticalSociety.SeriesB34, 187–220. Dean, Bonnie B, Lam, Jessica, Natoli, Jaime L, Butler, Qiana, Aguilar, Daniel and Nordyke, Robert J. (2003). Use of electronic medical records for health outcomes research: A literature review. Medical Care ResearchandReview 31(6), 611–638. Golub,T.R., Slonim,D.K.,Tamayo,P.,Huard,C.,Gaasenbeek,M.,Mesirov, J.P.,Coller,H.,Loh,M.L.,Down- ing, J.R., Caligiuri, M.A., Bloomfield, C.D. and others. (1999). Molecular classification of cancer: Class dis- covery and class prediction by gene expression monitoring. Science 286(5439), 531–537. Hassett, Michael J, Uno, Hajime, Cronin, Angel M, Carroll, Nikki M, Hornbrook, Mark C and Ritzwoller, Debra. (2017). Detecting lung and colorectal cancer recurrence using structured clini- cal/administrativedatatoenableoutcomesresearchandpopulationhealthmanagement.MedicalCare55(12), e88–e98. 20 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425976doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425976 http://creativecommons.org/licenses/by-nc/4.0/ Hodgkins, Adam J, Bonney, Andrew, Mullan, Judy, Mayne, Darren John and Barnett, Stephen. (2017). Survival analysis using primary care electronic health record data: A systematic review of the literature. Health InformationManagement Journal 47(1), 6–16. Hripcsak, George and Albers, David J. (2012). Next-generation phenotyping of electronic health records. Journal of theAmericanMedical InformaticsAssociation 20(1), 117–121. Huang, Jian. (1996). Efficient estimation for the proportional hazards model with interval censoring. The Annals of Statistics 24(2), 540–568. Huang, Sicong, Huang, Jie, Cai, Tianrun, Dahal, Kumar P, Cagan, Andrew, He, Zeling, Stratton, Jack- lyn, Gorelik, Isaac, Hong, Chuan, Cai, Tianxi and others. (2020). Impact of icd10 and secular changes on electronic medical record rheumatoid arthritis algorithms. Rheumatology. Kaji, Deepak A, Zech, John R, Kim, Jun S, Cho, Samuel K, Dangayach, Neha S, Costa, Anthony B and Oer- mann, Eric K. (2019). An attention based deep learning model of clinical events in the intensive care unit. PLoSOne 14(2), e0211057. Kaplan, Mariana J. (2010). Cardiovascular complications of rheumatoid arthritis - assessment, prevention. and treatment. RheumaticDiseaseClinics ofNorthAmerica 36(2), 405–426. Kohane, Isaac S, Churchill, Susanne E and Murphy, Shawn N. (2012). A translational engine at the na- tional scale: informatics for integrating biology and the bedside. Journal of the American Medical Informatics Association 19(2), 181–185. Liang, Yong, Chai, Hua, Liu, Xiao-Ying, Xu, Zong-Ben, Zhang, Hai and Leung, Kwong-Sak. (2016). Cancer survival analysis using semi-supervised learning method based on cox and aft models with l1/2 regulariza- tion. BMCMedicalGenomics 9(11), 11. 21 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425976doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425976 http://creativecommons.org/licenses/by-nc/4.0/ Liao, Katherine P, Sun, Jiehuan, Cai, Tianrun A, Link, Nicholas, Hong, Chuan, Huang, Jie, Huffman, Jennifer E, Gronsbell, Jessica, Zhang, Yichi, Ho, Yuk-Lam, Castro, Victor, Gainer, Vivian, Murphy, ShawnN,O’Donnell,ChristopherJ,Caziano,JMichael,Cho,Kelly,Szolovits,Peter,Kohane,IsaacS, Yu, Sheng and others. (2019). High-throughput multimodal automated phenotyping (map) with application of phewas. Journal of theAmericanMedical InformaticsAssociation 26(11), 1255–1262. Lin,Hung-Mo,Williamson,JohnMandKim,Hae-Young. (2019). Firthadjustmentforweibullcurrent-status survival analysis. Communications inStatistics -TheoryandMethods 49(18), 4587–4602. Liu, Bin, Li, Ying, Sun, Zhaonan, Ghosh, Soumya and Ng, Kenney. (2018). Early prediction of diabetes com- plicationsfromelectronichealthrecords:Amulti-tasksurvivalanalysisapproach. In:The32ndAAAIConference onArtificial Intelligence. Association for the Advancement of Artificial Intelligence. pp. 101–108. Miotto, Riccardo, Li, Li, Kidd, Brian A and Dudley, Joel T. (2016). Deep patient: an unsupervised represen- tation to predict the future of patients from the electronic health records. ScientificReports 6(6), 26094. Nicola,PauloJ.,Crowson,CynthiaS.,Maradit-Kremers,Hilal,Ballman,KarlaV.,Roger,VeroniqueL., Jacobsen, Steven J. and Gabriel, Sherine E. (2006). Contribution of congestive heart failure and ischemic heart disease to excell mortality in rheumatoid arthritis. Arthritis Rheumatology 54(1), 60–67. Nicola, Paulo J., Maradit-Kremers, Hilal, Roger, Veronique L., Jacobsen, Steven J., Crowson, Cyn- thia S., Ballman, Karla V. and Gabriel, Sherine E. (2005). The risk of congestive heart failure in rheuma- toid arthritis: a population-based study over 46 years. Arthritis Rheumatology 52(2), 412–420. Pagan, Adrian and Ullah, Aman. (1999). Nonparametric econometrics. Cambridge university press. Panahiazar, Maryam, Taslimitehrani, Vahid, Pereira, Naveen and Pathak, Jyotishman. (2015). Using ehrs and machine learning for heart failure survival analysis. Studies inHealthTechnologyand Informatics 216, 40–44. 22 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425976doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425976 http://creativecommons.org/licenses/by-nc/4.0/ Rotnitzky, Andrea and Robins, James M. (2014). Inverse probability weighting in survival analysis. Wiley StatsRef: StatisticsReferenceOnline. Ruan, Tong, Lei, Liqi, Zhou, Yangming, Zhai, Jie, Zhang, Le, He, Ping and Gao, Ju. (2019). Representation learning for clinical time series prediction tasks in electronic health records. BMC Medical Informatics and DecisionMaking 19(259). Seaman, Shaun R and White, Ian R. (2013). Review of inverse probability weighting for dealing with missing data. Statisticalmethods inmedical research 22(3), 278–295. Steele, Andrew J, Denaxas, Spiros C, Shah, Anoop D, Hemingway, Harry and Luscombe, Nicholas M. (2018). Machine learning models in electronic health records can outperform conventional survival models for predicting patient mortality in coronary artery disease. PLoSOne 13(8), e0202344. Uno,Hajime,Ritzwoller,DebraP,Cronin,AngelM,Carroll,NikkiM,Hornbrook,MarkCandHassett, MichaelJ. (2018). Determiningthetimeofcancerrecurrenceusingclaimsorelectronicmedicalrecorddata. JCOClinicalCancer Informatics 2, 1–10. van der Laan, Mark J and Jewell, Nicholas P. (2003). Current status and right-censored data structures when observing a marker at the censoring time. TheAnnals of Statistics 31(2), 512–535. van der Laan, Mark J and Robins, James M. (1998). Locally efficient estimation with current status data and time-dependent covariates. Journal of theAmericanStatisticalAssociation 93(442), 693–701. Vardi, Y. (1982). Nonparametric estimation in the presence of length bias. Annals of Statistics 10, 178–203. Zhao, Yue, Herring, Amy H, Zhou, Haibo, Ali, Mirza W and Koch, Gary G. (2014). A multiple imputation methodforsensitivityanalysesof time-to-eventdatawithpossibly informativecensoring. JournalofBiophar- maceutical Statistics 24(2), 229–253. 23 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425976doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425976 http://creativecommons.org/licenses/by-nc/4.0/