key: cord-0722088-i12de6f2 authors: Shen, Pao-Sheng title: The nonparametric maximum likelihood estimator for middle-censored data date: 2011-02-18 journal: J Stat Plan Inference DOI: 10.1016/j.jspi.2011.02.014 sha: 07815f7f4d5711d2c737bfc2562bd07bf4644189 doc_id: 722088 cord_uid: i12de6f2 In this note, we consider data subjected to middle censoring where the variable of interest becomes unobservable when it falls within an interval of censorship. We demonstrate that the nonparametric maximum likelihood estimator (NPMLE) of distribution function can be obtained by using Turnbull's (1976) EM algorithm or self-consistent estimating equation (Jammalamadaka and Mangalam, 2003) with an initial estimator which puts mass only on the innermost intervals. The consistency of the NPMLE can be established based on the asymptotic properties of self-consistent estimators (SCE) with mixed interval-censored data (Yu et al., 2000, Yu et al., 2001). In this note, we consider data subjected to middle censoring where the variable of interest becomes unobservable when it falls within an interval of censorship. We demonstrate that the nonparametric maximum likelihood estimator (NPMLE) of distribution function can be obtained by using Turnbull's (1976) EM algorithm or self-consistent estimating equation (Jammalamadaka and Mangalam, 2003) with an initial estimator which puts mass only on the innermost intervals. The consistency of the NPMLE can be established based on the asymptotic properties of self-consistent estimators (SCE) with mixed interval-censored data (Yu et al., 2000 (Yu et al., , 2001 . & 2011 Elsevier B.V. All rights reserved. Middle censoring occurs when a data point falls inside a random censoring interval whereby it becomes unobservable. For some individuals the exact values are available while for others the corresponding intervals of censorship are observed. We mention two situations where middle censoring occurs. (i) In a follow-up study, if the childhood learning center where the observations are being taken, is closed for a period, due to an external emergency such as the outbreak of severe acute respiratory syndrome (SARS). (ii) In a clinical trial, where the clinic where the observations are being taken, is closed for a period, due to an external emergency such as the outbreak of war or a strike. For situations (i) and (ii) where during a fixed time interval (this fixed interval is indeed, a random interval (denoted by (U,V)) relative to individual's lifetime) the observation was not possible. If some children (or patients) develop a skill (or disease) of interest during this time, we are not able to observe the exact age T of these children (or patients) at the time of skill (or disease) development, rather only the information that the event of interest occurred during a certain time interval (U,V). At first glance, middle censoring, where a random middle part is missing, appears as complementary to the idea of double censoring in which the middle part is what is actually observed. However, a careful reflection and analysis shows them to be quite different ideas; see Jammalamadaka and Mangalam (2003) for details. Let T i , i=1,y,n, be a sequence of i.i.d. random variables with distribution function F 0 . Independent of T i 's, let (U i ,V i ), i=1,y,n, be i.i.d. extended real-valued random variables with joint distribution function K 0 ðx,yÞ ¼ PðU i r x,V i r yÞ such that interval and that prevents us from distinguishing any two distributions which are identical outside [a,b] but differing only on [a,b] . In many censoring situations, if we were to try to estimate the distribution function via the EM algorithm the resulting equation takes the form F S ðtÞ ¼ EF S ½E n jX, ð1:2Þ as described by Tsai and Crowley (1985) , where E n is the empirical distribution function and X denotes the observed data. This equation was first introduced and referred to as self-consistency equation by Efron (1967) . A solutionF S of (1.2) is called a self-consistent estimator (SCE) of F 0 . In different types of censoring, the relationship between nonparametric maximum likelihood estimator (NPMLE) and SCE has been studied by various authors. In the case of right censoring the product-limit estimator (Kaplan and Meier, 1958) is the NPMLE, and Efron (1967) showed that it is also self-consistent. In the double-censoring case, Mykland and Ren (1996) provided a necessary and sufficient condition for an SCE to be an NPMLE. In the middle censored cases the self-consistent estimator (SCE) (see Jammalamadaka and Mangalam, 2003 )F S satisfies the following equation: ð1:3Þ Jammalamadaka and Mangalam (2003) showed that the NPMLE satisfies the self-consistency equation (1.3). They also pointed out that an SCE provides only a local maximum of the likelihood equation and may not be an NPMLE. Furthermore, they showed that the NPMLE will have all its mass on the uncensored observations except when it so happens that a censored interval contains no uncensored observation. The consistency of the SCEF S was established by Jammalamadaka and Mangalam (2003) for the special case when either U i or V i is degenerate. Jammalamadaka and Iyer (2004) proposed an approximation to the distribution function F 0 (denoted by F 0 0 ) for which a modified self-consistent estimatorF 0 S was obtained. They established the asymptotic properties ofF 0 S and provided an upper bound of the difference between F 0 and F 0 0 . Mangalam et al. (2008) showed that under condition (A) every censoring interval contains at least one uncensored observation, i.e. d i ¼ 0 implies that there exists j such that d j ¼ 1 and X j 2 ðU i ,V i Þ, the solution of Eq. (1.3) will be unique, and as a consequenceF n will be equal to an NPMLE. Mangalam et al. (2008) proposed a technique for obtaining the NPMLE by dividing the original problem into subproblems. In this note, we aim to establish connections between the middle-censoring and interval censoring by investigating the self-consistency algorithm. In Section 2, we shall demonstrate that the NPMLE of F 0 can be obtained by using the EM algorithm of Turnbull (1976) or the self-consistent estimating equation (Jammalamadaka and Mangalam, 2003) with an initial estimator which puts mass only on the innermost intervals. Furthermore, we establish the consistency and asymptotic normality of the NPMLE by using the results of Yu et al. (2000 Yu et al. ( , 2001 . 2.1. Self-consistency Turnbull (1976) characterized the NPMLE in the presence of interval censoring and truncation. Frydman (1994) later corrected Turnbull's characterization. Here, we consider the case when there is no truncation. Turnbull (1976) , Frydman (1994) and Alioum and Commenges (1996) , we consider nonparametric estimation of F 0 using the independent observation A i 's. Since Turnbull's EM algorithm can be used to tackle the case when A i is a single point set, the connection between middle censoring and interval censoring can be established. Based on the notations defined above, the likelihood is proportional to where P F 0 ðA i Þ denotes the probability that is assigned to the interval by F 0 . We define an NPMLE asF For censored data, using graph theory, Gentleman and Vandal (2001) presented methods for finding the NPMLE of F 0 . For censored and truncated data, Hudgens (2005) employed a graph theoretical approach to describe the support set of the NPMLE of F 0 . Define innermost intervals H j , j=1,y,J, induced by A 1 ,y,A n to be all the disjoint intervals which are non-empty intersections of these A i 's (e.g. A k ¼ A k \ A k is an intersection of A i 's) such that A i \ H j ¼ | or H j for all i and j. Let the endpoints of the innermost intervals be q j and p j , j=1,y,J, where 0 r q 1 rp 1 r q 2 rp 2 r Á Á Á rq J rp J r 1: Peto (1973) showed that the NPMLE of F 0 assigns weight, say s 1 , . . . ,s J , to the corresponding innermost intervals H 1 ,y,H J only. Thus, it suffices to maximize O ¼ s 2 R J : Based on the estimatorsŝ j 's, an estimatorF M ðtÞ of F 0 (t) can be uniquely defined for t 2 ½p j ,q j þ 1 Þ byF M ðp j Þ 1 F M ðq j þ 1 ÀÞ ¼ŝ 1 þ Á Á Á þŝ j , but is not uniquely defined for t being in an open innermost interval (q j ,p j ) with q j o p j . To avoid ambiguity we defineF M ðtÞ ¼ŝ 1 þ Á Á Á þŝ jÀ1 þ s j ðtÀq j Þ=ðp j Àq j Þ if t 2 ðq j ,p j and 0 o q j o p j o1. Next, we shall show that the estimatorF M satisfies self-consistent equation (1.3), i.e. ð2:8Þ Theorem 1. The NPMLEF M satisfies Eq. (2.8). Proof. First, notice that for each ðq j ,p j Þ ðj ¼ 1, . . . ,JÞ either q j =p j or q j o p j and there is no uncensored observation in (q j ,p j ) if q j op j . Furthermore, d j ðŝÞ can be written as Consider an initial estimatorF ð0Þ , which puts mass only on ðq j ,p j Þ ðj ¼ 1, . . . ,JÞ. LetF ð1Þ denote the first step estimator. Without changing the innermost intervals and likelihood function, we can transform data by moving all right censored points between p j À 1 and q j to p j À 1 . Similarly, move all left censored points between p j À 1 and q j to q j (see Li et al., 1997) . Hence, we have Hence, F (1) also puts mass only on ðq j ,p j Þ ðj ¼ 1, . . . ,JÞ. Next, we consider the following two cases: Case 1: q j = p j . When q j =p j , we havê Case 2: q j op j . When q j op j , since there is no uncensored observations in (q j ,p j ), we havê First, we have P n i ¼ 1 d i I ½q j o T i o p j ¼ 0. Next, note that given an interval (L i ,R i ) and d i ¼ 0, we either have ðq j ,p j Þ DðL i ,R i Þ or This conclusion is the same as in Theorem 1 of Jammalamadaka and Mangalam (2003) . However, our proof is based on the EM algorithm of Turnbull (1976) . It is obvious that if condition (A) holds then q j =p j for all j, and the solutionŝ j 's will have all its mass on the uncensored observations. Furthermore, if we start with an initial estimator which puts weight 1/J on q j = p j for uncensored observations and on (q j +p j )/2 for censored observations, we can obtain an NPMLE by using Eq. (2.8). However, similar to interval-censored data, the self-consistent NPMLE of F 0 is not uniquely defined for x 2 ðq j ,p j Þ if q j o p j . An SCE with an initial estimator which puts weight on intervals other than (q j , p j ) can lead to a less efficient estimator (not NPMLE). In this section, we shall investigate large sample properties ofF M . First, we introduce mixed interval censored (MIC) data. A data set is called a MIC data when it consists of both exact observations and case 2 interval censoring data (i.e. L i o R i ). Mixed IC data arises in clinical follow-up studies where a tumor maker (e.g., Ca 125 in ovarian cancer) is available, a patient whose marker value is consistently on the high (or low) end of normal range in repeated testing is usually under close surveillance for possible relapse. If such a patient should relapse, then the time to clinical relapse can often be accurately determined. However, if a patient is not under close surveillance, and would seek assistance only after some tangible symptoms have appeared, then time to relapse would be subject to case 2 interval censoring. For MIC data, several models have been proposed, and the asymptotic properties of the NPMLE have been investigated under the assumption that either the censoring variables take on finite many values (see Huang, 1999; Yu et al., 1998 Yu et al., , 2000 , or the censoring and survival distributions are strictly increasing and continuous and they have ''positive separation'' (see Huang, 1999, Assumption (A3) ). For MIC data, define (Y i ,Z i ) as a pair of extended random censoring times (1 allowed) with PðY i o Z i Þ ¼ 1, and T i is independent of (Y i ,Z i ). Yu et al. (2000, see (2. 1)) considered a mixture interval censorship model to characterize MIC data as follows: Replacing (Y i ,Z i ) and (Y i ,Z i ] in (2.14) with (U i ,V i ), we obtain the model for middle-censored data as follows: ( Hence, although the sampling scheme of MIC data seems to be quite different in character from that of middle-censored data described in Section 1, the resulting observations (L i , R i ) would reduce to the observations from middle-censoring data when there is no left or right censoring. Proof. Let Q denote the empirical version of the joint distribution function of (L i ,R i ) (i=1,y,n). It follows that Eq. (2.8) can be written as where Q n is the empirical version of Q. In (2.15), if F(t)=F(r À )=F(l), then we encounter 0 0 in the integrand. In this case, we define 0 0 ¼ 1. Notice that Eq. (2.15) is exactly the same as Eq. (2.3) of Yu et al. (2000) and Eq. (2.2) of Yu et al. (2001) , which is a self-consistent equation of F 0 for the model in Yu et al. (2000 Yu et al. ( , 2001 with mixed interval-censored (MIC) data. By Theorems 2.1, 2.2 and 3.1 of Yu et al. (2001) , the strong consistency and asymptotic normality ofF M andF S are established. & We have demonstrated how middle-censored data relate to mixed interval-censored data. With some modification of the definition for intervals (q j , p j )'s, we can obtain the NPMLE of distribution function by using EM algorithm of Turnbull (1976) or self-consistent estimating equation (Jammalamadaka and Mangalam, 2003) with a proper initial estimator. The consistency and asymptotic normality of the NPMLE can be established based on the asymptotic properties of selfconsistent estimators (SCE) with mixed interval censored data (Yu et al., 2000 (Yu et al., , 2001 . A proportional hazards model for arbitrarily censored and truncated data The two-sample problem with censored data A note on nonparametric estimation of the distribution function from interval-censored and truncated data Maximum likelihood for interval censored data: consistency and computation Computational algorithm for censored-data problems using intersection graphs Asymptotic properties of nonparametric estimation based on partly interval-censored data On nonparametric maximum likelihood estimation with interval censoring and truncation Non-parametric estimation for middle-censored data Approximate self consistency for middle-censored data Nonparametric estimation from incomplete observations An EM algorithm for smoothing the self-consistent estimator of survival functions with interval-censored data On computation of NPMLE for middle-censored data Algorithms for computing self-consistent and maximum likelihood estimators with doubly censored data Experimental survival curves for interval-censored data A large sample study of generalized maximum likelihood estimators from incomplete data via self-consistency The empirical distribution with arbitrarily grouped, censored and truncated data Asymptotic variance of the GMLE of a survival function with interval-censored data On consistency of the self-consistent estimator of survival function with interval censored data Asymptotic properties of self-consistent estimators with mixed interval-censored data The author would like to thank the associate editor and the referees for their helpful and valuable comments and suggestions.