Semi-supervised Calibration of Risk with Noisy Event Times (SCORNET) Using Electronic Health Record Data


Semi-supervised Calibration of Risk with Noisy Event Times

(SCORNET) Using Electronic Health Record Data

Yuri Ahuja, Liang Liang, Selena Huang, Tianxi Cai

January 9, 2021

Abstract

Leveraging large-scaleelectronichealthrecord (EHR)data toestimatesurvival curves forclinical events

canenablemorepowerfulriskestimationandcomparativeeffectivenessresearch.However,useofEHRdata

is hindered by a lack of direct event times observations. Occurrence times of relevant diagnostic codes or

target disease mentions in clinical notes are at best a good approximation of the true disease onset time.

On the other hand, extracting precise information on the exact event time requires laborious manual chart

reviewand is sometimesaltogether infeasibleduetoa lackofdetaileddocumentation.Currentstatus labels

– binary indicators of phenotype status during follow up – are significantly more efficient and feasible to

compile, enablingmoreprecise survival curveestimationgiven limitedresources.Existingsurvivalanalysis

methodsusingcurrentstatus labels focusalmostentirelyonsupervisedestimation,andnaiveincorporation

of unlabeled data into these methods may lead to biased results. In this paper we propose Semi-supervised

CalibrationofRiskwithNoisyEventTimes(SCORNET),whichyieldsaconsistentandefficientsurvivalcurve

estimator by leveraging a small size of current status labels and a large size of imperfect surrogate features.

In addition to providing theoretical justification of SCORNET, we demonstrate in both simulation and real-

worldEHRsettingsthatSCORNETachievesefficiencyakintotheparametricWeibullregressionmodel,while

alsoexhibitingnon-parametricflexibilityandrelativelylowempiricalbiasinavarietyofgenerativesettings.

1

.CC-BY-NC 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425976doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425976
http://creativecommons.org/licenses/by-nc/4.0/


1 Introduction

The Electronic Health Record (EHR) has in recent years become an increasingly available source of data for

clinicalandtranslational research(Kohaneandothers, 2012;HripcsakandAlbers, 2012;Miottoandothers, 2016).

Comprisingheterogeneousclinicalencounters includingdiagnosticandproceduralbillingcodes, labtests,pre-

scriptions, and free text clinical notes for millions of patients, these rich data offer abundant opportunities for

insilicoepidemiologicalanalysis.Oneapplicationthathasgarneredrecent interest isestimationofpopulation

disease risk within EHR patient cohorts, which can enable more powerful and precise estimation of real-world

disease risks as well as comparative effectiveness analysis of alternative treatment strategies (Hodgkins and

others, 2017; Dean and others, 2003; Liu and others, 2018; Panahiazar and others, 2015; Steele and others, 2018).

Several studies have had success estimating time to death within rule-defined disease cohorts (Panahiazar and

others, 2015; Steele and others, 2018). However, estimating the temporal risk of developing a disease is a more

challenging task due to EHR’s lack of direct observations of either disease status or the timing of disease on-

set.Convenientproxiesofdiseasestatusoronset timebasedonreadilyavailable featuressuchas International

Classification of Disease (ICD) codes often exhibit low specificity and systematic temporal biases, potentially

yielding highly biased disease risk estimators if used as event time labels (Cipparone and others, 2015; Uno and

others, 2018). On the other hand, extracting precise information on disease outcomes requires labor-intensive

manualchartreview,whichisparticularlychallengingforeventtimessincetheeventmayoccuroutsideof the

hospital system and only be mentioned during follow-up visits. It is thus only practically feasible to annotate

the current status Δ = �() ≤ �) of the event time), where � is the follow up time.

In this paper, we consider the problem of estimating the disease risk �(C) = %() ≤ C) when only a small

numberoflabelsonΔ andalargequantityofunlabeledEHRfeaturesW, includingproxiesof),areavailable.Su-

pervisedsurvival curveestimationwithcurrentstatusdataon {Δ,�} iswell established inthestatistical liter-

ature with several available parametric, semi-parametric and non-parametric procedures (Vardi, 1982; Huang,

2

.CC-BY-NC 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425976doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425976
http://creativecommons.org/licenses/by-nc/4.0/


1996; van der Laan and Robins, 1998; van der Laan and Jewell, 2003; Lin and others, 2019, e.g.). For example,

van der Laan and Robins (1998) proposed a non-parametric, locally efficient estimator via inverse probability

of censoring weighting (IPCW), assuming that (1)) and � are conditionally independent given some informa-

tive baseline covariates Z0 ⊂ W (e.g. age, sex, etc.) and (2) a consistent estimator for the conditional density of

� | `0 is available.However, theseexistingestimatorsdonot leverageunlabeledEHRfeature informationsuch

as time to first surrogate ICD code, which may greatly improve risk estimation precision.

SinceWmaybehighlypredictiveof), theestimationof((C)canpotentiallybeimprovedviasemi-supervised

learning (SSL) leveraging both the small set of Δ observations in the labeled set and the EHR features W in the

unlabeledset. SSLhasbeenshowntosignificantlymitigatebiasand/or improveefficiency forvariousriskpre-

diction applications (Chai and others, 2017; Liang and others, 2016; Bair and Tibshirani, 2004; Golub and others,

1999). For example, several studies employ semi-parametric models to impute event times in the unlabeled

set for subsequent input into an outcome survival model alongside labeled data (Chai and others, 2017; Liang

and others, 2016; Zhao and others, 2014; Uno and others, 2018; Hassett and others, 2017; Chubak and others, 2012;

Choi and others, 2015; Kaji and others, 2019; Ruan and others, 2019; Ahuja and others, 2020a). While such imputa-

tion approaches may improve efficiency under correct specification of the imputation model, they are subject

to significant bias if the imputation model is misspecified. In addition, these existing methods do not allow for

useofcurrentstatus labels fortraining.Othergeneralaugmentedinverseprobabilityweightingmethodsinthe

missing data literature (Seaman and White, 2013; Rotnitzky and Robins, 2014, e.g.) are not directly applicable

here since the probabilities of labels being observed tend to zero in the SSL setting.

We address this shortcoming by proposing Semi-supervised Calibration of Risk with Noisy Event Times

(SCORNET) for estimation of ((C). SCORNET utilizes current status labels while also employing a robust semi-

supervised imputation approach on the extensive unlabeled set to maximize survival estimation efficiency. To

mitigateimputationbiasandmaximizeefficiencygainfromtheunlabeleddata,SCORNETutilizesahighlyflexi-

3

.CC-BY-NC 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425976doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425976
http://creativecommons.org/licenses/by-nc/4.0/


blesemi-non-parametrickernelregressionmodelwithEHRfeaturesascovariates,whichensuresthevalidityof

the resulting risk estimator without requiring the imputation model to hold. In addition to providing theoret-

ical justifications for the SCORNET estimator, we illustrate via simulation studies that SCORNET substantially

outperforms existing methods with regards to the bias-variance tradeoff. The rest of the paper is organized

as follows. In Section 2, we detail the SCORNET procedure along with its associated inference procedures. In

Section3, wereport riskestimation performancerelative toexisting methods in diversesimulation studies.To

further illustrate the utility of SCORNET in clinical applications, we apply it to a real-world EHR study estimat-

ing the risk of heart failure among rheumatoid arthritis patients in Section 4. Finally, in Section 5 we briefly

discuss the strengths, weaknesses, and potential applications of SCORNET.

2 Methods

2.1 Setup

Let) denotetheeventtimeforwhichweareinterestedinestimatingacumulativedistributionfunction �(C) =

%() ≤ C) and survival function ((C) = 1−�(C). In the EHR study we do not observe) but ratherΔ = �() ≤ �)

for a small labeled subset, where � is the follow up time with finite support [0, g2]. For all subjects, we also

observe a set of baseline covariates `0 and longitudinal EHR features `. Since codes used in the EHR are often

highlysensitivebutnotspecific, thereoftenexistssomefiltervariableF ∈ {0, 1} suchthatΔ8 | (F8 = 0,W8) = 0

almostsurely,whereW8 = (`T0,8, `
T
8
)T.Moreover,weassumethat (), `)� | `0.Weassumethatdataforanalysis

consistofasmall setof= current-status-labeledobservationsrandomlyselectedamongthosewithF = 1 along

with a larger set of # unlabeled observations:D = {D8 = (�8,+8Δ8,W8,+8,F8)T, 8 = 1, ..., #} = L∪U, where

L = {(�8,Δ8,W8, 1, 1)T : F8 = 1,+8 = 1, 8 = 1, ...,=} andU = {(�8, 0,W8, 0,F8)T : +8 = 0, 8 = = + 1, ..., #} with

log(#)/log(=) → a0 > 3/2 as = →∞.

4

.CC-BY-NC 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425976doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425976
http://creativecommons.org/licenses/by-nc/4.0/


Sincethecensoring� maydependon `0,we followthe IPCWstrategyofvanderLaanandRobins (1998) to

weight observations by

lC,1(� | `0) =
 1(� − C)
52(C | `0)

where  1(B) =  (B/1)/1, 52(C | `0) = 3�2(C | `0)/3C, �2(C | `0) = %(� ≤ C | `0),  (·) is some symmetric

density function, and 0 < 1 = $(=−a) is the bandwidth thereof with a ∈ (1/5, 1/3]. IPCW enables consistent

estimation of functionals of) ≤ C and W since for any reasonable choice of function @(·) and 0, 3 ∈ {0, 1},

�
{
Δ
3
8 @(W8)F

0
8 lC,1(�8 | `0,8)

}
= �

{
�()8 ≤ C)3@(W8)F08

}
+$(12). (1)

The IPCW estimator for c(C) = %()8 ≤ C | F8 = 1) proposed by van der Laan and Robins (1998) essentially

corresponds to

(C) =
∑=
8=1 F8Δ8lC,1(�8 | `0,8)∑=
8=1 F8lC,1(�8 | `0,8)

with 52(C | `0) in lC,1(�8 | `0,8) replaced by a consistent estimator that converges faster than =−1/4, which

is not difficult to achieve under reasonable modeling assumptions since �8 | `0,8 can be estimated using the

full data D. To this end, we propose to derive an estimator for the conditional density 52(C | `0) = _2(C |

`0,8)(2(C | `0,8)byimposingasemi-parametricmodel for� | `0.Althoughmanycommonlyemployedmodels

can be used since once again � is fully observed for all patients, we illustrate our proposal by focusing on the

Cox proportional hazards model (Cox, 1972) under which

_2(C | `0,8) = _02(C)4$
T`0,8 and (2(C | `0,8) ≡ 1 −�2(C | `0,8) = exp

{
−Λ02(C)4$

T`0,8
}
, (2)

where _2(C | `0,8) is the conditional hazard function of �8 | `0,8, _02(C) is the unknown baseline hazard func-

tion, Λ02(C) =
∫ C

0
_02(B)3B, and $ is the vector of unknown covariate effects.

2.2 SCORNET Estimation

As outlined in Figure 1, SCORNET consists of three steps: (1) estimating the conditional censoring distribution

ℎ (C | `0)usingD; (2)fittingan imputationworkingmodel for c(C | W) ≡ %() ≤ C | W,F = 1)usingL, denoting

5

.CC-BY-NC 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425976doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425976
http://creativecommons.org/licenses/by-nc/4.0/


theestimateof c(C | W) as (C | W); and (3)estimating ((C) bymarginalizing (C | W)F+Δ (1 −F) = (C | W)Fvia

IPCW.

Figure 1: Schematic of the SCORNET algorithm.

2.2.1 Step 1: Estimate 52 (C | `0) Under the Cox Model for � | `0

Toestimate 52(C | `0),wefittheCoxmodel (2) tothefulldataD toobtainthepartial likelihoodestimator $̂ for

$. We subsequently estimate Λ0(C) and _02(C) respectively as the standard Breslow estimator Λ̂02(C) and the

kernel-smoothed Breslow estimator _̂02(C) (Basha and Hoxha, 2019), where

Λ̂02(C) =
#∑
9=1

�(�9 ≤ C)∑#
8=1 �(�8 ≥ �9)exp

(
$̂T`0,8

) , _̂02(C) = #∑
9=1

 0#
(
�9 − C

)∑#
8=1 �(�8 ≥ �9)exp

(
$̂T`0,8

) ,
and 0# = $(#−a2) with a2 ∈ (1/5, 1/3]. We then obtain

_̂2(C | `0) = _̂02(C)4$̂
T`0,8 and (̂2(C | `0) = exp

{
−Λ̂02(C)4$̂

T`0,8
}
,

and we estimate 52(C | `0,8) as 5̂2(C | `0,8) = _̂2(C | `0,8)(̂2(C | `0,8). Following standard asymptotic results

for non-parametric kernel regression (Pagan and Ullah, 1999), it is not difficult to show that sup`0,C | 5̂2(C |

6

.CC-BY-NC 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425976doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425976
http://creativecommons.org/licenses/by-nc/4.0/


`0) − 52(C | `0)| = $?{log(#)1/2(#0#)−1/2} = >?(). We denote the resulting estimate for the censoring

weight as l̂1=(C | `0,8) =  1=(�8 − C)/ 5̂2(C | `0,8).

2.2.2 Step 2: Estimate an Imputation Model c(C | W8) ≡ % ()8 ≤ C | W8,F8 = 1)

To leverage the unlabeled data, we fit a flexible imputation working model

c(C | W8) = 6
{
"(C) + #0(C)

T ®̀0,8 + #(C)T`8
}
= 6

{
)(C)T ®W8

}
(3)

where `8 denotes theEHRsurrogate features, )(C) =
(
U(C), #0(C)T, #(C)T

)T, and ®W8 = (1, ®̀T0,8, `T8 )T.Under (3),
%()8 ≤ C | W8,F8 = 1) = 6

{
)(C)T ®W8

}
, and hence we may estimate )(C) as )̂(C) =

(
Û(C), #̂0(C)T, #̂(C)T

)T
, the

solution to the IPCW estimating equation evaluated withL,

=∑
8=1

l̂C,1=(�8 | `0,8)F8 ®W8
{
Δ8 −6

(
)T ®W8

)}
= 0,

where 1= = $(=−a) with a ∈ (1/5, 1/2). In practice, 1= can be chosen via either standard cross-validation or

heuristicplug-invalues.ForafutureobservationwithfilterstatusF8 = 1 andcovariatesW8,weimpute �()8 ≤ C)

as the conditional risk ĉ(C | W8) = 6
{
)̂(C)T ®W8

}
.

It is not difficult to show that )̂(C) converges in probability to )̄(C), the solution to the limiting estimating

equation

�

[
®W8

{
�()8 ≤ C) −6()T ®W8)

}
| F8 = 1

]
= 0,

which ensures that

� {c̄(C | W8) | F8 = 1} = %()8 ≤ C | F8 = 1), where c̄(C | W8) = 6{)̄(C)T ®W8}, (4)

regardless of the adequacy of the imputation model (3).

2.2.3 Step 3: Estimate �(C) by Marginalizing Imputed Risks

Finally, we marginalize the imputed values ĉ8C = ĉ(C | W8) ∀F8 = 1 and Δ8 = 0 ∀F8 = 0 to estimate �(C). Since

F8 depends on �8, we again employ IPCW to marginalize ĉ8CF8 +Δ8(1 −F8) = ĉ8CF8 and thereby construct our

7

.CC-BY-NC 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425976doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425976
http://creativecommons.org/licenses/by-nc/4.0/


final estimator for �(C):

�̂(C) =
∑#
8=1{ĉ8CF8 +Δ8(1 −F8)}l̂C,0# (�8 | `0,8)∑#

8=1 l̂C,0# (�8 | `0,8)
=

∑#
8=1 ĉ8CF8l̂C,0# (�8 | `0,8)∑#
8=1 l̂C,0# (�8 | `0,8)

.

2.3 Inference for �̂(C)

Following standard theory for non-parametric kernel regression (Pagan and Ullah, 1999), we show in the Sup-

plementary Materials that �̂(C) → %()8 ≤ C,F8 = 1) + %()8 ≤ C,F8 = 0) = %()8 ≤ C) = �(C) in probability

under mild regularity conditions and correct specification of the censoring model regardless of the adequacy

of the imputation model. Here, we note that for any C ∈ [0, g2], 0 = %(Δ8 = 0 | F8 = 0,�8 = C,W8) implies that

%()8 ≤ C | F8 = 0) = 0. Furthermore,

(=1=)1/2
{
�̂(C) −�(C)

}
=

(
1=

=

)1/2 =∑
8=1

l̂C,1=(�8 | `0,8)F8 {Δ8 − c̄(C | W8)} +>?(1)

=

(
1=

=

)1/2 =∑
8=1

lC,1=(�8 | `0,8)F8 {Δ8 − c̄(C | W8)} +>?(1)

since supC | 5̂2(C | `0) − 52(C | `0)| = >?(). It follows that (=1=)1/2{�̂(C) −�(C)} is asymptotically normal with

mean 0 and variance

f
2(C) = '( )�{V(C | `0,8)/ 52(C | `0,8)}, (5)

where

V(C | `0,8) = �[F8{�()8 ≤ C) − c̄(C | W8)}2 | `0,8]

and '( ) =
∫
 (G)23G. Our derivation for the asymptotic distribution of �̂(C) can effectively ignore the vari-

ability associated with the estimation of censoring weights, which simplifies the asymptotic variance f2(C).

Importantly, f2(C) decreases as the imputation model approximates c(C | W8) better since V(C | `0,8) =

�[F8{�()8 ≤ C) − c(C | W8)}2 | `0,8] + �[F8{c(C | W8) − c̄(C | W8)}2 | `0,8] decreases. To estimate f2(C)

in practice, one may construct a plug-in estimator,

f̂
2(C) =

1=

=

=∑
8=1

l̂C,1=(�8 | `0,8)
2F8

{
Δ8 −6

(
)̂(C)T ®W8

)}2
.

8

.CC-BY-NC 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425976doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425976
http://creativecommons.org/licenses/by-nc/4.0/


By contrast, the supervised IPCW estimator that incorporates filter negative patients takes the form

�̂(C) =
∑#
8=1{ĉ(C)F8 +Δ8(1 −F8)}l̂C,0# (�8 | `0,8)∑#

8=1 l̂C,0# (�8 | `0,8)
=

∑#
8=1 ĉ(C)F8l̂C,0# (�8 | `0,8)∑#

8=1 l̂C,0# (�8 | `0,8)
.

The asymptotic variance of (=1=)1/2{�̂(C) −�(C)} is then f2(C) = '( )�{V(C | `0,8)/ 52(C | `0,8)}, where

V(C | `0,8) = �
[
F8{�()8 ≤ C) −c(C)}2 | `0,8

]
.

The variance f2(C) is equivalent to that of SCORNET if and only if the feature set W is uninformative for) (i.e.

)W). Supervised IPCWisotherwise lessefficient,withrelativeefficiencycontrolledbytherelativemagnitudes

of the marginal error �[{�()8 ≤ C) − �(C)}2 | F8 = 1, `0,8] and the conditional error �[{�()8 ≤ C) − c̄(C |

W8)}2 | F8 = 1, `0,8].

3 Simulation Study

We conduct extensive simulation experimentation to evaluate the finite sample performance of the proposed

SCORNET estimator in realistic settings with = ∈ {100, 200} observed labels within the set of filter-positive

patients, defining the filter to have 99% sensitivity and 88% specificity for Δ. We compare SCORNET to three

existing survival function estimators with current status data: 1) parametric Weibull Accelerated Failure Time

(AFT) regression with interval event times (Lin andothers, 2019), 2) semi-parametric Cox Proportional Hazards

regression with interval event times and Breslow baseline hazard estimation (Huang, 1996; Cox, 1972; Breslow,

1972), and3)non-parametric IPCWestimation(vanderLaanandRobins,1998).Weincorporate thefilter inthe

Weibull and Cox models by setting Δ8 | (F8 = 0) = 0 and weighting the = labeled filter-positive patients by

1
=

∑#
8=1 F8. Weibull and Cox are implemented using the icenReg package in R, while IPCW is implemented per

the algorithm detailed in van der Laan and Robins (1998), estimating � | `0 usingDunder the Cox model. We

note that estimating the censoring distribution usingLyields similar asymptotic performance to usingD, but

in finite sample settings the latter offers higher efficiency.

9

.CC-BY-NC 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425976doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425976
http://creativecommons.org/licenses/by-nc/4.0/


Setting `0 ∼ � | `0 ∼ ) | `0 ∼ ` ∼

1 Unif(−1, 1) Weibull
(
104−0.5`0, 32

)
Weibull

(
154−0.3`0, 52

)
Normal{) + 1,f(�)/2}

2 Unif(−1, 1) Weibull
(
104−0.5`0, 32

)
Weibull

(
15, 52

)
Normal{) + 1,f(�)/2}

3 Unif(−1, 1) Weibull
(
104−0.5`0, 32

)
Weibull

(
154−0.3`

2
0, 52

)
Normal{) + 1,f(�)/2}

4 Unif(−1, 1) Weibull
(
104−0.5`0, 32

)
Logistic (15 − 4`0, 3) Normal{) + 1,f(�)/2}

5 Unif(−1, 1) Weibull
(
10, 32

)
Weibull

(
154−0.3`0, 52

)
Normal{) + 1,f(�)/2}

6 Unif(−1, 1) Weibull
(
104−0.5`

2
0, 32

)
Weibull

(
154−0.3`0, 52

)
Normal{) + 1,f(�)/2}

Table 1: Generative parameters employed in our simulation study.

Weconsider6diversegenerativemechanismsasdetailedinTable3,includingcaseswhereWeibull-distributed

accelerated failure time of ) | `0, proportional hazards of ) | `0, and proportional hazards of � | `0 are re-

spectively violated, as well as cases where SCORNET’s imputation model is and is not misspecified. In settings

1, 2, and 5, we consider various cases where SCORNET and all comparator methods are correctly specified. In

setting 1 we consider the specific case where � and ) both depend on `0, and both � | `0 and ) | `0 are

Weibull-distributed satisfying accelerated failure time and proportional hazards. In setting 2, by contrast, we

consider a case where )`0 to assess robustness to over-parametrization of this relationship, and in setting 5

we consider a case where �`0 to evaluate robustness to over-parametrization thereof. In settings 3 and 4 we

assess the benefit of SCORNET and IPCW’s robustness to the distribution of ) | `0 when this distribution sat-

isfies neither Weibull accelerated failure time nor proportional hazards. We evaluate SCORNET’s sensitivity to

misspecification of the imputation model in settings 1, 3, and 5, as compared to correct specification thereof

in settings 2 and 4. Finally, in setting 6 we assess the sensitivity of SCORNET and IPCW to misspecification of

theconditionalcensoringmodel� | `0. Foreachgivenconfiguration,wecomputetheempiricalbias, standard

error, and root mean squared error (RMSE) of all estimators for �(C) based on their average performance on

10

.CC-BY-NC 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425976doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425976
http://creativecommons.org/licenses/by-nc/4.0/


500 simulated datasets evaluated at 100 equally-spaced time points C ∈ [&�(0.1) + 1=,&�(0.9) − 1=], where

& denotes the quantile function of � under the configuration. We used plug-in bandwidths of 1= = B̂(�)=−1/4

and 0# = B̂(�)#−1/4 for the imputation (Step 2) and marginalization (Step 3) steps of SCORNET respectively,

where B̂ is the empirical standard deviation of observed �. We present the performance of the estimators av-

eraged over the selected time points using = = 200 labels in Figure 2. The performance at each time point can

be found in Supplementary Figure 1, and time-averaged performance using = = 100 labels can be found in

Supplementary Table 1 of the Supplementary Materials.

Figure 2: Time-averaged empirical absolute biases (left), standard errors (second from left), root relative efficiencies

(second from right), and relative RMSEs (right) of the Weibull Accelerated Failure Time (red), Cox Proportional Hazards

w/ Breslow baseline (blue), supervised IPCW (green), and SCORNET estimators using weakly informative (purple) and

strongly informative (orange) surrogates, in various simulated settings with = = 200 observed current status labels.

11

.CC-BY-NC 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425976doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425976
http://creativecommons.org/licenses/by-nc/4.0/


As Figure 2 demonstrates, imputing using a strongly informative feature ` (SCORNET-Strong) results in

consistently higher efficiency than just using the weakly informative baseline `0 (SCORNET-Weak), which in

turn is markedly more efficient than not leveraging the unlabeled set at all (IPCW). SCORNET makes minimal

assumptions regarding the distribution of ) | `0, settling for non-parametric efficiency in exchange for en-

hanced flexibility. By contrast, the Weibull regression model fully parametrizes ) | `0, and the Cox model

assumes proportional hazards thereof, increasing efficiency at the expense of bias in the case of misspecifica-

tion. As expected, Weibull consistently achieves higher empirical efficiency than Cox, which in turn is more

efficient than IPCW across settings. Notably, SCORNET consistently achieves empirical efficiency comparable

to Weibull and significantly higher than Cox despite being far more flexible than both, again highlighting the

efficiencygainedbyleveragingauxiliaryinformationtoimputeunobservedrisks.Atthesametime,SCORNETis

muchlesssusceptibletomodelmisspecificationbias thanWeibull, asdemonstratedbythe latter’s significantly

higherbiasandRMSEinSetting4. Indeed,SCORNETachievesrelatively lowmeanabsolutebiasacrosssettings,

with MSE apparently dominated by variance rather than bias in the setting of 100-200 labels. Consistent with

the theory, SCORNET is robust to misspecification of the imputation model in settings 1, 3, and 5, achieving

equivalently insignificantbiasas insetting2andmarginallybutnotmeaningfullyhigherbias thaninsetting4.

That said, correctnessof the imputationmodel insettings2and4doesnotyieldanymeaningful change inrel-

ative efficiency, likely because inherent variability functionally dominates imputation model bias given so few

labels. Reassuringly, SCORNET (and IPCW) appear insensitive to misspecification of � | `0 in setting 6, achiev-

ing functionally equivalent bias to the correctly-specified Weibull and Cox models. Altogether, these results

corroborate the assertion that SCORNET’s semi-supervised utilization of informative feature data to impute

risks in the unlabeled set improves estimation efficiency without introducing bias regardless of the validity of

the imputationmodel.Moreover, theysuggest thatSCORNETisparticularlyvaluable insettingswhere (1)flex-

12

.CC-BY-NC 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425976doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425976
http://creativecommons.org/licenses/by-nc/4.0/


ibility is desired with regard to the distribution of) | `0, and (2) there exists a large set of unlabeled patients

with associated EHR data – both commonplace in retrospective observational studies.

Figure3:Empiricalcoverageprobabilitiesaveragedovertime(left)andplottedovertime(right)ofSCORNET-Strong’s95%

confidence intervals constructed with the bootstrap (red) and plug-in (blue) standard error estimators in various simu-

lated settings with = = 200 observed current status labels. See Table 1 for details of the generative mechanism employed

in each setting.

Toassessthefinitesampleperformanceoftheproposedintervalestimationprocedures,weobtainstandard

errorestimatesbothusingtheproposedplug-inestimator f̂(C) andviabootstrapwith500replicates. InFigure

3 we demonstrate empirical coverage probabilities of SCORNET’s 95% Wald confidence intervals constructed

using each standard error estimator, both averaged over the selected timepoints (left) and at each timepoint

(right). Reassuringly, we find that the 95% confidence intervals using both plug-in and bootstrap estimators

achievenearly 95% meancoverageacrosssettings.Coverageonlydropsbelow 90% at thetailsof theevent time

13

.CC-BY-NC 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425976doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425976
http://creativecommons.org/licenses/by-nc/4.0/


support due to moderately increased bias from kernel smoothing thereabout. The plug-in estimator achieves

marginally lower coverage than the bootstrap estimator at the right tail due to underestimation of the true

standard error, likely because of overfitting of the imputation model given low local censoring density (and

thus low effective #). Notably, we do not observe this trend in setting 4, wherein correct specification of the

imputation model obviates overfitting. Thus, we posit that the plug-in estimator can be reliably used for fi-

nite sample problems with = ∈ [100, 200] labels as long as the conditional censoring density 52(� | `0) is

sufficiently high and the timepoints evaluated are sufficiently far from the tails of the event time support.

4 Application to Assessing Heart Failure Risk Among Rheumatoid Arthritis

Patients

Rheumatoidarthritis (RA),achronicinflammatorydiseasethataffectsapproximately 1% ofthegeneralpopula-

tion, is associated with dramatically increased risk of heart failure (HF) morbidity and mortality (Kaplan, 2010;

Nicola andothers, 2005, 2006; Ahlers andothers, 2020). One study estimated that RA patients have a 1.9-fold life-

time risk of developing HF compared to matched RA-negative controls (Nicola andothers, 2005), while another

estimated that HF accounts for 13% of excess mortality among RA patients (Nicola and others, 2006). Ongoing

interest lies in estimating the risk of developing HF subtypes in RA cohorts and quantifying the risk modifying

effect of various RA treatments (Ahlers and others, 2020). Due to the increased availability of electronic health

record (EHR) data, it is now possible to assess HF risk for a broader RA population using these data. For ex-

ample, at Mass General Brigham we previously established an EHR cohort of #0 = 16, 358 RA patients (Huang

and others, 2020). This large RA cohort can potentially be used to study the longitudinal risk of HF among RA

patients.

However, such analysis is not straightforward as HF status is not readily available within the RA cohort. We

propose to estimate HF risk among RA patients by leveraging (1) = current status labels on HF status obtained

14

.CC-BY-NC 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425976doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425976
http://creativecommons.org/licenses/by-nc/4.0/


via manual chart review, and (2) informative yet unlabeled EHR data, including time to first ICD code for HF,

as surrogate variables `. We estimate both the age-specific HF risk, �age(·), and the risk of developing HF after

the patient’s incident ICD code for RA (714), �RA+(·), among patients with at least 6 months of follow up whose

incident RA codes occur after the age of 16 to select for adult-onset as opposed to juvenile RA. Among filter-

positive patients, defined as having at least 1 ICD code for HF, we have = = 300 labels on censoring time HF

status Δ for age-specific HF risk, and we have = = 126 for post-RA HF risk. We let the baseline covariates `0

includesexanddecadeoffirstEHReventfor �age(·), andsex,decadeoffirstRAcode,andageatfirstRAcodefor

�RA+(·).WeobtainHFriskestimatorsbasedonSCORNETaswellas theaforementionedcomparatorestimators.

For the imputation model in Step 2 of SCORNET, we consider three EHR-derived surrogate risk predictors for

`: (1) thepredictedΔ basedontheunsupervisedMultimodalAutomatedPhenotyping(MAP)algorithm,which

uses the total counts of HF ICD codes and mentions of HF in clinical notes, as well as the total count of all ICD

codesasahealthcareutilizationmeasure (Liaoandothers, 2019), (2) thepredictedΔ basedontheunsupervised

Surrogate-guidedEnsembleLatentDirichletAllocation(sureLDA)algorithm,which leverages thefeaturesused

in MAP as well as 121 additional manually-selected EHR features including counts of relevant medications,

ICD codes, and concept unique identifiers (CUIs) in clinical notes (Ahuja and others, 2020b); and (3) the time to

first HF ICD code. As in our simulation, we select plug-in bandwidths of 1= = B̂(�)=−1/4 and 0# = B̂(�)#−1/4

for the imputation and marginalization steps of SCORNET respectively, and we evaluate risk at 100 timepoints

C ∈ [&�(0.1) +1=,&�(0.9)−1=]. We again compare the performance of SCORNET to that of Weibull, Cox, and

IPCW,incorporatingthefilter intheWeibullandCoxmodelsbypropensityweightingaswedointhesimulation

study.

In Figure 4, we show the estimated HF risk curves along with their standard errors. Reassuringly, all meth-

ods appear to agree rather closely for estimation of both age-specific HF risk and HF risk after RA diagnosis.

For the latter quantity, however, Weibull and Cox appear to underfit while IPCW appears to overfit relative to

15

.CC-BY-NC 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425976doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425976
http://creativecommons.org/licenses/by-nc/4.0/


Figure4:Estimatedage-specificandpost-RAcumulativerisksofheart failure(top)andbootstrapstandarderrorsthereof

(bottom)overtimeoftheWeibullAcceleratedFailureTime(red,short-long-dashed),CoxProportionalHazardsw/Breslow

baseline (blue, dot-dashed), supervised IPCW (green, dashed), and SCORNET (purple, solid) estimators.

theSCORNETestimator,whichappears toachieveareasonablemiddleground.Moreover,SCORNETonceagain

attainsstandarderrorscomparable to thoseof theWeibull estimatorandmeaningfully lowerthanthoseof the

Cox and IPCW estimators. This suggests that while the Weibull and Cox models potentially fail to capture the

complexity of the post-RA HF risk function, and IPCW is too unstable for a limited labeled set of size = = 126,

SCORNET offers an attractive balance of efficiency and flexibility and is thus well conditioned for such a sce-

nario. As shown in Figure 5, averaged over the selected timepoints, the root relative efficiency of SCORNET

is 1.11, 2.55, and 3.31 compared to the Weibull, Cox, and IPCW estimators respectively for estimation of age-

specific risk, and 1.34, 2.32, and 3.85 respectively for estimation of HF risk after RA diagnosis. Once again, the

fact that SCORNET achieves efficiency moderately higher than the relatively inflexible Weibull model and sig-

16

.CC-BY-NC 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425976doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425976
http://creativecommons.org/licenses/by-nc/4.0/


nificantly higher than the Coxand IPCW estimators reflects the value of leveraging available information from

the EHR to bolster risk estimation efficiency.

Figure 5: Time-averaged bootstrap standard errors (left) and empirical root relative efficiencies (right) of the Weibull

AcceleratedFailureTime(red),CoxProportionalHazardsw/Breslowbaseline(blue), IPCW(green),andSCORNET(purple)

estimators for estimation of (1) age-specific HF risk (left), and (2) HF risk after RA diagnosis (right), among RA patients in

the Partners EHR database.

5 Discussion

By leveragingasizeableunlabeleddatasetcontaining imperfect surrogatesof the trueevent timesandasmall

set with observed current status labels, the SCORNET estimator serves as a robust and efficient alternative

to existing model-free survival estimators with current status data. The semi-supervised nature of SCORNET

makes it well-conditioned to EHR-based survival estimation in settings where only a limited number of labels

are available or readily obtainable. Moreover, by only requiring current status labels rather than the precise

17

.CC-BY-NC 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425976doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425976
http://creativecommons.org/licenses/by-nc/4.0/


timing of event onset, SCORNET greatly reduces the burden of chart review and increases the feasibility of

studying disease risk using EHR data.

To allow for covariate-dependent censoring, which is frequently present in observational settings, SCOR-

NET requires additional assumptions on the distribution of � | `0. Although we choose the proportional haz-

ards model for illustration, any standard semi-parametric model will yield similar properties for the resulting

estimator. Since {�, `0} are observed for all subjects, one can potentially allow for more flexible (i.e. non-

parametric) censoring models. That said, our simulation results suggest that SCORNET is relatively insensitive

to misspecification of the model for � | `0. Even under mild misspecification, it achieves consistently lower

mean squared errors than existing estimators.

When interest lies in assessing how risk differs across different patient sub-populations, it is straightfor-

wardtoextendSCORNETtoestimatesubgroup-specificrisks forasmallnumberof subgroups.However, future

research is warranted to estimate covariate-specific risks for a general set of covariates.

6 Software

An R package, including a sample use case and complete documentation, is available at

https://cran.r-project.org/web/packages/SCORNET/index.html. Source code can be found at

https://github.com/celehs/SCORNET.

Funding

ThisworkwassupportedbytheU.S.National InstitutesofHealthGrantsT32-AR05588512,T32-GM7489714,and

R21-CA242940.

18

.CC-BY-NC 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425976doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425976
http://creativecommons.org/licenses/by-nc/4.0/


Acknowledgements

The authors declare no conflicts of interest.

References

Ahlers, Michael J., Lowery, Brandon D., Farber-Eger, Eric, Wang, Thomas J., Bradham, William, Orm-

seth, Michelle J., Chung, Cecilia P., Stein, C. Michael and Gupta, Deepak K. (2020). Heart failure risk

associated with rheumatoid arthritis-related chronic inflammation. Journal of the American Heart Association,

9.

Ahuja, Yuri, Hong, Chuan, Xia, Zongqi and Cai, Tianxi. (2020a). Samgep: A novel method for prediction of

phenotype event times using the electronic health record. Preprint.

Ahuja, Yuri, Zhou, Doudou, He, Zeling, Sun, Jiehuan, Castro, Victor M, Gainer, Vivian, Murphy,

Shawn N, Hong, Chuan and Cai, Tianxi. (2020b). surelda: A multidisease automated phenotyping method

for the electronic health record. Journal of theAmericanMedical InformaticsAssociation 27(8), 1235–1243.

Bair, Eric and Tibshirani, Robert. (2004). Semi-supervised methods to predict patient survival from gene

expression data. PLoSBiology 2(4), E108.

Basha,LuleandHoxha,Fatmir. (2019). Kernelestimationofthebaselinefunctioninthecoxmodel. European

Scientific Journal 15(6), 105–118.

Breslow,NormanE. (1972). Discussionofprofessorcox’spaper. Journalof theRoyalStatisticalSociety,SeriesB34,

216–217.

Chai,Hua,Li,Zi-na,Meng,De-yu,Xia,Liang-yongandLiang,Yong. (2017). Anewsemi-supervisedlearning

model combined with cox and sp-aft models in cancer survival analysis. ScientificReports 7(13053).

19

.CC-BY-NC 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425976doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425976
http://creativecommons.org/licenses/by-nc/4.0/


Choi, Edward, Du, Nan, Chen, Robert, Song, Le and Sun, Jimeng. (2015). Constructing disease network and

temporal progression model via context-sensitive hawkes process. IEEE Computer Society. pp. 101–108.

Chubak, Jessica, Yu, Onchee, Pocobelli, Gaia, Lamerato, Lois, Webster, Joe, Prout, Marianna N, Yood,

Marianne Ulcickas, Barlow, William E and Buist, Dianna SM. (2012). Administrative data algorithms

to identify second breast cancer events following early-stage invasive breast cancer. Journal of the National

Cancer Institute 104(12), 931–940.

Cipparone, Charlotte W, Withiam-Leitch, Matthew, Kimminau, Kim S, Fox, Chet H, Singh, Ranjit and

Kahn, Linda. (2015). Inaccuracy of icd-9 codes for chronic kidney disease: A study from two practice-based

research networks (pbrns). The Journal of theAmericanBoardof FamilyMedicine 28(5), 26094.

Cox,DavidR. (1972). Regressionmodelsandlife-tables. Journalof theRoyalStatisticalSociety.SeriesB34, 187–220.

Dean, Bonnie B, Lam, Jessica, Natoli, Jaime L, Butler, Qiana, Aguilar, Daniel and Nordyke, Robert J.

(2003). Use of electronic medical records for health outcomes research: A literature review. Medical Care

ResearchandReview 31(6), 611–638.

Golub,T.R., Slonim,D.K.,Tamayo,P.,Huard,C.,Gaasenbeek,M.,Mesirov, J.P.,Coller,H.,Loh,M.L.,Down-

ing, J.R., Caligiuri, M.A., Bloomfield, C.D. and others. (1999). Molecular classification of cancer: Class dis-

covery and class prediction by gene expression monitoring. Science 286(5439), 531–537.

Hassett, Michael J, Uno, Hajime, Cronin, Angel M, Carroll, Nikki M, Hornbrook, Mark C and

Ritzwoller, Debra. (2017). Detecting lung and colorectal cancer recurrence using structured clini-

cal/administrativedatatoenableoutcomesresearchandpopulationhealthmanagement.MedicalCare55(12),

e88–e98.

20

.CC-BY-NC 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425976doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425976
http://creativecommons.org/licenses/by-nc/4.0/


Hodgkins, Adam J, Bonney, Andrew, Mullan, Judy, Mayne, Darren John and Barnett, Stephen. (2017).

Survival analysis using primary care electronic health record data: A systematic review of the literature.

Health InformationManagement Journal 47(1), 6–16.

Hripcsak, George and Albers, David J. (2012). Next-generation phenotyping of electronic health records.

Journal of theAmericanMedical InformaticsAssociation 20(1), 117–121.

Huang, Jian. (1996). Efficient estimation for the proportional hazards model with interval censoring. The

Annals of Statistics 24(2), 540–568.

Huang, Sicong, Huang, Jie, Cai, Tianrun, Dahal, Kumar P, Cagan, Andrew, He, Zeling, Stratton, Jack-

lyn, Gorelik, Isaac, Hong, Chuan, Cai, Tianxi and others. (2020). Impact of icd10 and secular changes on

electronic medical record rheumatoid arthritis algorithms. Rheumatology.

Kaji, Deepak A, Zech, John R, Kim, Jun S, Cho, Samuel K, Dangayach, Neha S, Costa, Anthony B and Oer-

mann, Eric K. (2019). An attention based deep learning model of clinical events in the intensive care unit.

PLoSOne 14(2), e0211057.

Kaplan, Mariana J. (2010). Cardiovascular complications of rheumatoid arthritis - assessment, prevention.

and treatment. RheumaticDiseaseClinics ofNorthAmerica 36(2), 405–426.

Kohane, Isaac S, Churchill, Susanne E and Murphy, Shawn N. (2012). A translational engine at the na-

tional scale: informatics for integrating biology and the bedside. Journal of the American Medical Informatics

Association 19(2), 181–185.

Liang, Yong, Chai, Hua, Liu, Xiao-Ying, Xu, Zong-Ben, Zhang, Hai and Leung, Kwong-Sak. (2016). Cancer

survival analysis using semi-supervised learning method based on cox and aft models with l1/2 regulariza-

tion. BMCMedicalGenomics 9(11), 11.

21

.CC-BY-NC 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425976doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425976
http://creativecommons.org/licenses/by-nc/4.0/


Liao, Katherine P, Sun, Jiehuan, Cai, Tianrun A, Link, Nicholas, Hong, Chuan, Huang, Jie, Huffman,

Jennifer E, Gronsbell, Jessica, Zhang, Yichi, Ho, Yuk-Lam, Castro, Victor, Gainer, Vivian, Murphy,

ShawnN,O’Donnell,ChristopherJ,Caziano,JMichael,Cho,Kelly,Szolovits,Peter,Kohane,IsaacS,

Yu, Sheng and others. (2019). High-throughput multimodal automated phenotyping (map) with application

of phewas. Journal of theAmericanMedical InformaticsAssociation 26(11), 1255–1262.

Lin,Hung-Mo,Williamson,JohnMandKim,Hae-Young. (2019). Firthadjustmentforweibullcurrent-status

survival analysis. Communications inStatistics -TheoryandMethods 49(18), 4587–4602.

Liu, Bin, Li, Ying, Sun, Zhaonan, Ghosh, Soumya and Ng, Kenney. (2018). Early prediction of diabetes com-

plicationsfromelectronichealthrecords:Amulti-tasksurvivalanalysisapproach. In:The32ndAAAIConference

onArtificial Intelligence. Association for the Advancement of Artificial Intelligence. pp. 101–108.

Miotto, Riccardo, Li, Li, Kidd, Brian A and Dudley, Joel T. (2016). Deep patient: an unsupervised represen-

tation to predict the future of patients from the electronic health records. ScientificReports 6(6), 26094.

Nicola,PauloJ.,Crowson,CynthiaS.,Maradit-Kremers,Hilal,Ballman,KarlaV.,Roger,VeroniqueL.,

Jacobsen, Steven J. and Gabriel, Sherine E. (2006). Contribution of congestive heart failure and ischemic

heart disease to excell mortality in rheumatoid arthritis. Arthritis Rheumatology 54(1), 60–67.

Nicola, Paulo J., Maradit-Kremers, Hilal, Roger, Veronique L., Jacobsen, Steven J., Crowson, Cyn-

thia S., Ballman, Karla V. and Gabriel, Sherine E. (2005). The risk of congestive heart failure in rheuma-

toid arthritis: a population-based study over 46 years. Arthritis Rheumatology 52(2), 412–420.

Pagan, Adrian and Ullah, Aman. (1999). Nonparametric econometrics. Cambridge university press.

Panahiazar, Maryam, Taslimitehrani, Vahid, Pereira, Naveen and Pathak, Jyotishman. (2015). Using

ehrs and machine learning for heart failure survival analysis. Studies inHealthTechnologyand Informatics 216,

40–44.

22

.CC-BY-NC 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425976doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425976
http://creativecommons.org/licenses/by-nc/4.0/


Rotnitzky, Andrea and Robins, James M. (2014). Inverse probability weighting in survival analysis. Wiley

StatsRef: StatisticsReferenceOnline.

Ruan, Tong, Lei, Liqi, Zhou, Yangming, Zhai, Jie, Zhang, Le, He, Ping and Gao, Ju. (2019). Representation

learning for clinical time series prediction tasks in electronic health records. BMC Medical Informatics and

DecisionMaking 19(259).

Seaman, Shaun R and White, Ian R. (2013). Review of inverse probability weighting for dealing with missing

data. Statisticalmethods inmedical research 22(3), 278–295.

Steele, Andrew J, Denaxas, Spiros C, Shah, Anoop D, Hemingway, Harry and Luscombe, Nicholas M.

(2018). Machine learning models in electronic health records can outperform conventional survival models

for predicting patient mortality in coronary artery disease. PLoSOne 13(8), e0202344.

Uno,Hajime,Ritzwoller,DebraP,Cronin,AngelM,Carroll,NikkiM,Hornbrook,MarkCandHassett,

MichaelJ. (2018). Determiningthetimeofcancerrecurrenceusingclaimsorelectronicmedicalrecorddata.

JCOClinicalCancer Informatics 2, 1–10.

van der Laan, Mark J and Jewell, Nicholas P. (2003). Current status and right-censored data structures

when observing a marker at the censoring time. TheAnnals of Statistics 31(2), 512–535.

van der Laan, Mark J and Robins, James M. (1998). Locally efficient estimation with current status data and

time-dependent covariates. Journal of theAmericanStatisticalAssociation 93(442), 693–701.

Vardi, Y. (1982). Nonparametric estimation in the presence of length bias. Annals of Statistics 10, 178–203.

Zhao, Yue, Herring, Amy H, Zhou, Haibo, Ali, Mirza W and Koch, Gary G. (2014). A multiple imputation

methodforsensitivityanalysesof time-to-eventdatawithpossibly informativecensoring. JournalofBiophar-

maceutical Statistics 24(2), 229–253.

23

.CC-BY-NC 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425976doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.425976
http://creativecommons.org/licenses/by-nc/4.0/