key: cord-0944395-vwkuq1wo
authors: Xiong, Yi; Braun, W. John; Hu, X. Joan
title: Estimating duration distribution aided by auxiliary longitudinal measures in presence of missing time origin
date: 2021-04-05
journal: Lifetime Data Anal
DOI: 10.1007/s10985-021-09520-w
sha: f0b020f8f6119518849293b45cf876d93609bfbd
doc_id: 944395
cord_uid: vwkuq1wo

Understanding the distribution of an event duration time is essential in many studies. The exact time to the event is often unavailable, and thus so is the full event duration. By linking relevant longitudinal measures to the event duration, we propose to estimate the duration distribution via the first-hitting-time model (e.g. Lee and Whitmore in Stat Sci 21(4):501–513, 2006). The longitudinal measures are assumed to follow a Wiener process with random drift. We apply a variant of the MCEM algorithm to compute likelihood-based estimators of the parameters in the longitudinal process model. This allows us to adapt the well-known empirical distribution function to estimate the duration distribution in the presence of missing time origin. Estimators with smooth realizations can then be obtained by conventional smoothing techniques. We establish the consistency and weak convergence of the proposed distribution estimator and present its variance estimation. We use a collection of wildland fire records from Alberta, Canada to motivate and illustrate the proposed approach. The finite-sample performance of the proposed estimator is examined by simulation. Viewing the available data as interval-censored times, we show that the proposed estimator can be more efficient than the well-established Turnbull estimator, an alternative that is often applied in such situations. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1007/s10985-021-09520-w.

The patterns of event duration times are of primary interest in many research studies. A close follow-up of each study individual who potentially experiences the event of interest is commonly unrealistic. Although the occurrence of the event may be reported, its start time is often unavailable, especially with so-called "silent" event occurrence (e.g. Balasubramanian and Lagakos 2003) .

For example, it is important for the prediction of future fire growth and the allocation of suppression resources to understand the distribution of the duration from the fire start time to the time when the work of suppressing the fire begins, i.e. the initial attack time. A wildland fire is usually reported to the fire management agency by look-out towers or people in the area, and the fire manager then dispatches fire-fighting resources (e.g. Martell 2007; Morin 2014) . The exact time when the fire starts is often unknown; instead, recorded is the time when the fire is reported. As another example, many HIV/AIDS studies are concerned with the duration of HIV infection from the time of infection to the onset of an AIDS event; see, for example, Degruttola et al. (1991) and Doksum and Normand (1995) . Often the infection is detected at a time considerably later after it takes place and thus the exact HIV infection time is usually unavailable. Time to COVID-19 infection is a more recent example of this phenomenon.

A practical approach to handling data with missing time origins ignores the reporting delay and performs inference on the duration distribution using the observed portion of the event duration, that is, the duration between the report time and event termination. This naive approach yields apparently biased inference when the time gap is nonignorable between the event onset to when it is reported. Another commonly used approach to handling times with missing time origin is to view the observation on the time origin subject to interval-censoring. Thus the lower limit of the interval is the length of time that the event has been observed (which we refer to as L * ), and the upper limit is the sum of the observed duration and the longest possible reporting delay (which we refer to as R max ). Turnbull's nonparametric maximum likelihood estimator (NPMLE; Turnbull 1976) can then be employed to estimate the distribution of the actual duration (which we refer to as L) using such manufactured interval-censored data. The resulting inference can be unsatisfactory, especially when the longest expected reporting delay is large relative to the observed portion of the event duration. In addition, the interval-censoring is likely informative in many situations, which invalidates Turnbull's estimator. For example, fires can occur at varying distances from fire management resources. It results in a reporting delay S and an observed duration L * varying together, and thus L * and S may not be independent. That is, the interval [L * , L * + R max ] provides information on L additional to the fact of L ∈ [L * , L * + R max ]. This paper considers estimation of the event duration distribution with the aforementioned type of event time data under a first-hitting-time model using the event associated longitudinal measures.

Many studies have readily available longitudinal measures associated with the event of interest. In reliability, for example, Lu and Meeker (1993) use degradation data to estimate the distribution of a failure time, taking the failure time to be the time when the degradation path hits a critical level. The concept of first-hitting-time has been widely applied. Various models have been used to formulate longitudinal measures, such as a Gamma process (e.g. Lawless and Crowder 2004), a Wiener process (e.g. Doksum and Hoyland 1992; Horrocks and Thompson 2004; Lee and Whitmore 2006; Pennell et al. 2009; Choi et al. 2014; Mulatya et al. 2016) , and an inverse Gaussian process (e.g. Peng 2015) .

We aim to make inference on the population of all reported fires. Most of the firsthitting-time models formulate the event time via a hypothetical/underlying process. Our modelling is similar to the excellent exception presented in Mulatya et al. (2016) . Using Brownian motion with random drift, we link the readily available longitudinal measures to the recorded times to locate the missing time origin. The issue of dependence between the reporting delay and the observed portion of duration is handled by conditioning on the random drift. We adapt the empirical distribution function, which requires independent and identically distributed (iid) observations, into an intuitive and easy-to-implement estimator for the distribution based on the event duration with missing time origin. Conventional smoothing techniques, such as kernel methods, can be straightforwardly applied to smooth the proposed estimator. A collection of wildland fire records from Alberta, Canada is used to motivate and present the proposed approach. However, potential applications of our approach are broad and not limited to wildland fire management studies.

The rest of the paper is organized as follows. Section 2 introduces a model for longitudinal measures of fire burnt area over time and proposes an estimator for the duration distribution aided by the longitudinal model using duration times with missing origins. It is straightforward to evaluate smoothed versions of the proposed estimator. We present procedures for estimating the parameters that are involved in the longitudinal model and required by the estimator for the duration distribution. We then derive the asymptotic properties of the distribution estimator and its variance estimation. Section 3 reports an analysis of wildland fire records with the proposed approach and Sect. 4 presents simulation studies conducted to examine finite-sample performance of the proposed estimator regarding its consistency, efficiency, and robustness. We also compare the performance of the proposed approach with that of the naive approach and of the Turnbull estimator. Some final remarks are given in Sect. 5.

We formulate the aforementioned statistical problem in terms of wildland fire management. Following Parks (1964) , Fig. 1 provides a description for the development of a hypothetical wildland fire via its progression of burnt area over time. The solid curve in the figure represents the burnt area overtime of a fire that is subject to suppression after detection. The time point when suppression begins is referred to as the time of initial attack. The dashed curve shows the expected trajectory of the fire's burnt area if it had continued to burn without any suppression or intervention. After ignition, the burnt area grows nonlinearly in time, and can be well approximated as exponential initially. Prior to initial attack, the dashed curve and the solid curve coincide. The Fig. 1 Hypothetical description of a wildfire progression through fire management phases following Parks (1964) . The solid curve represents the burnt area of a fire that is reported and then dispatched with suppression resources. The dashed curve represents the burnt area of the fire that had continued to burn without any suppression times T S , T R , and T F in Fig. 1 are the calendar times when a fire starts, is reported, and initial attack begins, respectively. The duration time of interest, denoted by L, is the elapsed time from the fire's start to the time of initial attack, i.e., L = T F − T S . We are concerned with situations where the true fire start time T S is unavailable, and thus the time origin of the event duration is missing. Denote the unobservable reporting delay by S = T R − T S . The observed portion of the duration is L * = T F − T R = L − S, the period between report time and initial attack time. Let A(u) be the burnt area at time u, where we assume there is no burnt area at the start time, i.e. A(u) = 0 when u = 0. Let B = A(S) and D = A(L) − A(S) denote the burnt area at the report time and the increase in area by the initial attack time, respectively.

Consider a collection of n independent wildland fires. We assume that the natural logarithm of fire i's burnt area is A i (u) = g i (u) + σ i W i (u) for i = 1, · · · , n, where g i (u) is a nondecreasing function with g i (0) = 0 and W i (·) is the standard Wiener process. As a fire usually grows unhindered until initial attack, we suppose σ i ≡ σ and use a linear approximation to g i (u) with random drift ν i = νe δ i , where the constant ν is positive, and δ i is independent of W i (·) and following a distribution φ(·; σ r ) with E[δ i ] = 0 and V ar[δ i ] = σ 2 r . This results in the model considered in this paper:

The random drift ν i characterizes the heterogeneity in fire growth among the individual fires. Note that ν i reduces to a constant drift when σ r = 0. In the rest of this paper, we assume δ i ∼ N (0, σ 2 r ) with σ r ≥ 0. Under the Wiener process model for burnt-area (1), the reporting delay S i (report time since the fire starts) can be viewed as a first-hitting-time: it is the time when fire i's burnt area first reaches the threshold B i , the burnt area at the report time: Chhikara and Folks (1989) , the first-hitting-time S i follows the inverse Gaussian distribution given threshold B i , with the cumulative distribution function (CDF)

where Φ(·) is the CDF of the standard normal distribution. Denote the observed data by

This paper focuses on estimation of F(·), the CDF of the event duration using the observed data (3) with the following assumptions. The assumptions may hold to a reasonable level of approximation in many practical situations. In our wildland fire application, the assumption A2 assumes that the reporting delay (S i ) of fire i and the time to the initial attack since being reported (L * i ) are independent conditional on the fire spread rate ν i . This is plausible since the fire agency often assesses a reported fire regarding its spread rate, and then arranges for the initial attack accordingly. That is, L * i depends likely on S i solely through ν i .

If all duration times L i for i = 1, . . . , n were observed, the empirical distribution function, the nonparametric maximum likelihood estimator (NPMLE) based on the iid observations, could be applied to estimate the duration distribution: F n (t) = n i=1 I (L i ≤ t) n, where I (E ) is the indicator function of event E . Since a fire is usually reported after a delay, only L * i , a portion of

with R max the longest possible reporting delay as discussed in the existing literature. The current observations might then be cast as interval-censored event times. Turnbull's self-consistent estimator (Turnbull 1976) can then be used to estimate the distribution with the interval-censored observations. Let 0 = τ 0 < τ 1 < · · · < τ Q be the ordered unique values of Klein and Moeschberger (2003) , the Turnbull estimator is the solution to the equations

for q = 1, . . . , Q. However, the Turnbull estimator may not perform very well in the situations of particular interest in this paper. Note that the Turnbull estimator is not uniquely defined over the whole positive real line but up to an equivalence class of distributions that may only differ over gaps, i.e. the innermost intervals. Since R max is large relative to L * i in our application, the data form relatively small number of innermost intervals and thus often give a quite noninformative estimate. Moreover, the mechanism of interval-censoring in wildland fire studies may be informative since the observed L * i is often dependent on the reporting delay S i through the fire spread rate ν i . Those considerations motivate us to propose an alternative estimator for the duration distribution F(·) using available observations on the burnt-area process, which contain information related to the reporting delay.

By Model (1) and Assumption (A2), note that E

is the conditional distribution of δ i given the observed data associated with fire i as specified in (3). The consideration above suggests the following estimator, provided that the parameters ν, σ, σ r are known:

We propose to replace parameters in (5) with their consistent estimators based on the available data. This results in a feasible distribution estimator,

abbreviated byF n (t) in the rest of this paper. In Sect. 2.3, we present procedures for consistently estimating parameters ν, σ, σ r . To compute (6) numerically, one may approximateF n (t) with

where δ

The proposed estimator in (5) is adapted from the empirical distribution function. Analogously, we can obtain a smoothed distribution estimator of F(·) by adapting the kernel distribution estimator with all the duration observed. Recall that a kernel estima-

a kernel function and h being the bandwidth (e.g., Rosenblatt 1956 ). Its projection onto the available data space

When one deals with the situation where no random effect is involved and S i is assumed to be uniformly distributed over [0, R max ], the estimatorF n,h (t) reduces to the one discussed in Braun et al. (2005) . Since the choice of bandwidth h is still under investigation, we focus on the estimatorF n (t) given in (6) for the rest of the paper.

The log-likelihood function based on the available data is

where the contribution from fire i log

respectively. We estimate θ by maximizing log L obs (θ Observed-data). Denote the resulting estimator byθ n . One can useθ n together with the collection of δ (1) , · · · , δ (J ) in the last iteration to compute (7) and (8) and to obtainF n (·) andF n,h (·), respectively. Here δ ( j) for j = 1, . . . , J are the n-dimensional vectors with the i-th components δ ( j) i .

We apply the MCEM algorithm (Wei and Tanner 1990) to compute the MLE, and present details in Algorithm A below. The log-likelihood function of θ based on the observed data (3) 

i ) is generated from the conditional distribution given the observed data with the current parameter estimate θ (m) ,

M-step. Maximize (10) with respect to θ to obtain θ (m+1) .

Repeat Steps E and M until ||θ (m+1) − θ (m) || < for a pre-specified tolerance . The limit of the sequence {θ (m) : m = 1, 2, . . .} is the MLEθ n . We follow the Metropolis-Hastings algorithm (Metropolis et al. 1953; Hastings 1970) (11). The details are provided in Sect. S1.1 of the Supplementary Material. One should also note that

, which does not have much additional information on the parameters ν, σ . To ease the computational burden, one may replace (10) with the following to update θ (m) :

The maximizing procedure based on (12) leads to a variant of Algorithm A and results inθ n , a close approximation to the MLEθ n . For the numerical studies presented in this paper, we choose J = 200 and the algorithm converges with J = 200.

We may obtain another estimator by maximizing the conditional likelihood function using only the observations on D and L * . The log conditional likelihood function is

and can be written as

The estimator obtained by maximizing (13) is likely less efficient but easier to implement. We describe the procedure to obtain the maximizer of (13) in Sect. S1.2 of the Supplementary Material. We refer to the second algorithm as Algorithm B and denote estimators obtained from the two algorithms byθ n A andθ n B for the reminder of the paper. The estimate obtained from Algorithm B may be used as the initial estimate θ (0) for Algorithm A.

The proposed estimatorF n (t) usingθ n A from Algorithm A in Sect. 2.3 has the following asymptotic properties. (9), the estimatorF n (t) has the following properties:

where θ 0 is the true parameter,

is the joint probability density function of L * i and B i . A proof of Theorem 1 is outlined in the Appendix. It results in a consistent estimator of the covariance function in (14) substituting its unknown elements with their following estimators.

Note that

i , · · · , δ (K ) i obtained from the last iteration of Algorithm A in Sect. 2.3. We may similarly approximate E θ 0 ∂ M(t, L * i , B i ; θ ) ∂θ . Further, note that Π n (θ 0 ) = −n −1 ∂ 2 log L obs (θ ; Observed-data )/∂θθ converges in probability to Π(θ 0 ) = Σ(θ 0 ), and so does Σ n (θ 0 ) = n −1 Var θ 0 ∂ log L obs (θ; Observed-data) ∂θ . Thus, either Π n (θ n A ) or Σ n (θ n A ) can be used to estimate Π(θ 0 ) = Σ(θ 0 ).

Based on Theorem 1, we see that the process √ n(F n (t) − F(t)) √ var(t) converges weakly to the standard Gaussian process, where var(t) is Cov(G(t), G(s)) given in (14) with t = s. Denote the consistent estimator of var(t) obtained as described in Sect. 2.4 by var(t). We employ the resampling approach in Hu and Lagakos (1999) ; Zhao et al. (2008) to construct the following confidence band (CB) for the distribution F(·).

The

where the critical value c α is determined by the resampling scheme as follows. For

where Z 1 , · · · , Z n ∼ N (0, 1) iid and are independent of the data. We compute c α as follows:

Step (i) . Generate M sets of independent realizations of (Z 1 , · · · , Z n ) and, with each of sets, compute C 

We now apply the proposed approach to analyze the wildland fire data that motivated this research. Alberta Agriculture and Forestry collected records of 603 lightningcaused fires that occurred in 10 wildland fire management areas of Alberta, Canada during the fire season from May to August in 2004. Each fire record contains the fire progression information: the times and the fire burnt area at the time of report and at the time of initial attack. As expected, the records do not include the exact fire start times. Figure 2 shows the burnt area at the report times and at initial attack times for the different regions. The 10 Alberta wildland fire management areas are classified into two Table S1 in Sect. S2 of the Supplementary Material summarizes burnt area for the two regions at the report times, the initial attack times, and at the time when fires were extinguished. Fires in the north region tend to have larger burnt area at the times of report and initial attack. The distributions of the burnt area are skewed so we use the transformed version log 10 (burnt area + 1) in the analysis.

The time of initial attack is when the first fire-fighting resource arrives at a wildland fire to prevent the fire from spreading, and to extinguish it if possible. It is believed that fires with a delayed initial attack may require a more substantial suppression effort.

Using the proposed approach, we estimate the distribution of the duration between the start of a fire to its initial attack. We consider two cases with Model (1): (i) σ r = 0 (i.e., ν i = ν for i = 1, 2, · · · , n), and (ii) σ r ≥ 0. Table 1 presents the parameter estimates and the corresponding standard error estimates obtained by Algorithms A and B in Sect. 2.3. The standard errors are estimated using both the inverse Fisher information matrix and the sandwich variance estimator. We also provide computing times for each algorithm in Table 1 . Algorithms B is computationally faster than Algorithms A, but it yields less efficient estimator as the estimated standard errors are larger than those of Algorithms A. The estimates of σ r for the model with random drift are quite large, indicating considerable variation among the fires. This could be because the fire spread rates depend on location and local weather.

We estimate the distribution of duration by substituting the estimated model parameters into (7), and obtained the smoothed estimator based on (8). To make a comparison, we also evaluated the empirical distribution function based on the observed event duration, the naive estimator, and the Turnbull estimator viewing the fire data as interval-censored data with L i ∈ [L * i , L * i + R max ]. We set R max = 6, 12 or 48 hours for illustration. In fact, R max could be up to 2 weeks (Wotton and Martell 2005) . Figure 3 presents the estimated distributions for the times to initial attack with Algorithms A and B together with approximate 95% pointwise confidence intervals (CIs) calculated using the estimated asymptotic variance given in (14). Figure 3 shows that the naive distribution estimate and the Turnbull's estimates are different from the proposed estimates. We see that Turnbull's estimates deviate more from the proposed estimates as R max increases. This is because a larger R max can lead to a wider interval (L * i , L * i + R max ) for L i . As a result, there are fewer disjoint innermost intervals within which the survivor function estimated by Turnbull's method can jump. Comparing the estimates by the two algorithms, we see that Algorithms A can produce a more efficient estimator. We also evaluate the kernel-smoothed estimator (8) presented in Sect. 2.2. The distribution estimates and their corresponding 95% CIs/CBs are in close agreement with those un-smoothed estimates. Figure 4 presents scatterplots of the final burnt area vs the estimated duration times. The estimated duration is calculated byL i = L * i +S i , whereS i is generated from the posterior distribution of reporting delay [S|δ, O i ; θ (m) ] at the last iteration of the MCEM procedure in Algorithms A, for i = 1, · · · , n. We present scatterplots using three realizations ofL i together with the scatterplot in Fig. 4a using the observed portion of the duration L * i . The pattern of association between the final burnt area and duration is apparently more obvious with the estimated duration. This suggests that the duration between fire start and initial attack may be more predictive of the final burnt area. Accounting for the reporting delay time is worthwhile when using the duration as a predictor for the final burnt area.

We applied the proposed procedure to analyze the data of fires from the north region and the ones from the south region separately. Table 2 gives the model parameter estimates. Figure 5 shows the estimated duration distributions. The estimate of σ r associated with north region is large, significantly different from zero. It indicates a larger variation across fires in the region. The south region has a smaller estimate of σ r .

We conducted two simulation studies to examine the finite-sample performance of the proposed approach and to verify the findings from the data analysis. Specifically, in the first simulation study, we generated data based on Model (1) to verify consistency and efficiency, and in the second simulation study, we assessed robustness of the approach against model misspecification.

To mimic the fire data, we simulated a study with n = 300 independent fires with the data of fire i for i = 1, 2, · · · , n generated as follows. (m) ] at the last iteration of the MCEM procedure in Algorithm A, for i = 1, · · · , n. The solid line represents the fitted linear regression line of the final burnt area and time to initial attack; the dashed curve is the local regression curve. The shaded areas define corresponding approximate 95% confidence bands (iii) Generate L * i ∼ Exp(3.0B −1 i ), calculate the duration L i = S i + L * i , and obtain the burnt area at the time of initial attack D i = A i (L i ).

Using the simulated data, we evaluated the estimatorθ n A , the approximation toθ n A , by the variant of Algorithm A in Sect. 2.3. We then obtained the corresponding duration distribution estimates. Table 3 summarizes the parameter estimates based on 200 simulation repetitions. The sample means of estimates obtained by Wiener process model with a constant drift are close to the true parameter values for the scenario of σ r = 0; the bias is evident when the true value of σ r increases to 0.5 and 0.8. When we use a model with random drift, i.e. σ r ≥ 0, the sample means of estimates are close to the true parameter values for all three scenarios of σ r . This provides an empirical verification of the consistency of the two estimators, and suggests that it may be acceptable not to assume σ r = 0 in practice. Further, we estimated θ by maximizing the conditional likelihood given in Table 2 Estimates of parameters in model (1) (13) which uses only observations on D and L * . The results are presented in Table S2 of Sect. S3.1 of the Supplementary Material. While the parameter estimates are similar to those obtained by Algorithm A, the sample means of the estimated standard errors are larger. The sample means of the estimated standard errors associated with the robust sandwich variance estimator are similar to the corresponding sample standard deviations of the estimates for both algorithms, which suggests that the proposed variance estimator performs sufficiently well at the simulation settings, we conclude that maximizing (13) may yield less efficient estimators. For each generated data set, we estimated the duration distribution byF n (t) using θ obtained from Algorithm A, and usedF n (t; ν, σ, σ r ) given in (5) with the true values of parameters as a reference. The consistent variance estimator of (14) given in Appendix C was evaluated to construct confidence intervals (CIs). Assuming the drift of Wiener process involves random effects, Fig. 6 shows the sample means of the 200 estimated distribution functions together with the approximate conventional 95% CIs and their 2.5%, 97.5% sample quantiles. To make a comparison, each plot in Fig. 6 also includes the sample means of the 200 evaluations of the empirical distribution function F n (·) using the true duration, the empirical distribution function F * n (·) using the observed duration times (the naive approach), and the Turnbull estimator using R max as the third quantile and maximum of the reporting delay in each generated data set.

The estimate associated with the proposed approach is very close to those based onF n (t; ν, σ, σ r ) using true θ . At all simulation settings, both the approximate 95% CIs and the CIs using the 2.5% and 97.5% sample quantiles contain the empirical distribution functions F n (·) obtained with true duration. The naive estimates and the Turnbull's estimates appear to be different from F n (·). The Turnbull estimator is highly dependent on the assumed values of R max , especially when R max is much larger than L * i , the performance of the Turnbull estimator deteriorates substantially. Histograms for realizations of L * i presented in Fig. S1 of Sect. S3.1 of the Supplementary Material support this finding. For the scenario that the true value of σ r is 0, the two values of R max are relatively small and Turnbull's estimates are close to the proposed estimate as shown in Fig. 6 . When the true value of σ r increases, R max becomes much greater than the maximum of L * i when it is chosen as the maximum of the reporting delays in the generated dataset and the corresponding Turnbull's estimates depart much further from the proposed estimates. This is consistent with the outcome seen in the data analysis. Moreover, we evaluated the distribution estimator usingθ n B obtained from Algorithm B (See Fig. S2 of Sect. S3.1 in the Supplementary Material) and the kernel smoothed version of the proposed estimator. The behavior of these two estimates in comparison with the naive estimates and Turnbull's estimates is similar to that observed in Fig. 6 .

We computed the point-wise sample mean square errors of Turnbull's estimates, the proposed estimates and the reference estimates based onF n (t; ν, σ, σ r ). With any t ≥ 0, the proposed estimator has the smallest sample mean squared error, which demonstrates the relative efficiency of the proposed estimator over the naive estimator and the Turnbull estimator at all simulation settings. Figure S3 in Sect. S3.1 of the Supplementary Material presents the sample standard deviations and sample means of the estimated standard errors of the proposed distribution estimator withθ n A by Algorithm A, together with those associated with the empirical distribution function andF n (·; ν, σ, σ r ), which require more information than the data structure of interest. The plots in the figure show that the variation of the proposed estimator is comparable to that forF n (·; ν, σ, σ r ), and is, in some settings, smaller than that of the empirical function. This indicates that using the available information on fire growth can recover the efficiency loss due to the missing start times and even in some situations outperform the empirical distribution function, a nonparametric estimator for the duration distribution.

We generated burnt area sample paths for a collection of simulated independent fires following the model A i (t) = ν i t + σ i W * i (t) , i = 1, 2, · · · , n = 300, where ν i = νe δ i with δ i ∼ N (0, σ 2 r ) and W * i (·) is a process with correlated increments. Specifically, the increments W * i (t k ) − W * i (t k−1 ) for the partition t k , k = 1, 2, · · · , K of the time We computed θ estimates and then the duration distribution estimates as if the data were generated from Model (1). Table S3 in Sect. S3.2 of the Supplementary Material summarizes the simulation outcomes for 200 replicates of the estimates. The sample means of the θ estimates are close to the true parameter values with the assumed Wiener process model using random drift. The sample means of the estimated standard errors from the robust estimator are fairly close to the sample standard deviations of the θ estimates. Figure S4 in Sect. S3.2 of the Supplementary Material presents the sample means of the estimated distribution functions from both Algorithms A and B with the approximate 95% CIs and their 2.5% and 97.5% quantiles. In each plot, we also overlaid the sample means of the estimates by empirical distribution function,F n (·; ν, σ, σ r ), the naive estimator, and the Turnbull estimator. These plots indicate that the proposed estimator is close to the empirical function in the situation, even when Model (1) does not hold.

We also explored scenarios where the burnt-area process is generated following other models, such as A i (t) = ν i t 2 + σ i W i (t). The estimated duration distribution based on the proposed approach assuming Model (1) is also close to the empirical function using all the true duration times. This indicates that the proposed estimator F n (·) can be quite robust to model misspecification. Further investigation could lead to a way of systematically checking the validity of Model (1).

We propose in this paper procedures for estimating the distribution of event duration with observations in the presence of missing time origins. By employing the distribution of the first-hitting-time with a Wiener process, we link the distribution of the event duration with associated longitudinal measures. Both simulation and real data analysis show that the proposed approach performs well in predicting the times to initial attack and also demonstrates the importance of taking into account the duration between the unobserved start time and the later report time.

The proposed approach is applicable to many situations where event duration is of interest but where the time origins in the duration observations are missing. Examples include predicting the length of period from the unknown HIV infection to detection of infection by making use of the longitudinal viral load measures (Doksum and Normand 1995) , predicting the lifetime of trees by using longitudinal measures of the diameter at breast height for trees (Thompson 2011), and, as suggested by a referee, estimating the onset time of a disease by utilizing longitudinal medical expenditure data such as the usage of prescription drugs and the cost of skilled nursing facilities. The idea underlying our approach could readily be applied to analysis under a different model for longitudinal measures, e.g. Wang (2008) ; Wang and Xu (2010) . It would be worth exploring the validity of the stochastic process for longitudinal measures.

Several other investigation would be worthwhile. The target population in the wildland fire application of this paper includes only the fires that are reported and have been dispatched with initial attack resources. When a study aims to explore the whole physical development process of wildland fire, the fires not reported should also be included in the population under consideration. The current available wildland fire records are then length-biased. We suggest to extend the idea of the proposed approach and adapt methods for estimating distributions with right-censored event times subject to length-biased sampling (e.g. Asgharian and Wolfson 2005; Huang and Qin 2011) to the situation where the origins of the duration times are missing.

Heterogeneity and correlation between fires should be accounted for. Applying the proposed approach to the data stratified by fire region has revealed that the event duration distributions are different for fires in different regions; see Table 2 and Fig. 5 , for example. The duration is likely related to fuel type and moisture content as well as wind activity and local topography. To deal with this problem, as discussed in Wang (2010), we could follow (Lawless and Crowder 2004) and specify the drift parameter ν i of Model (1) as a function of covariates. In addition, due to potential correlation between wildland fires, it would be of interest to extend the approach to account for spatio-temporal correlation. A third possibility is to follow (Heitjan and Rubin 1990) to accommodate semi-continuous data with the rounded burnt-area records shown in Table S1 of Sect. S2 of the Supplementary Material.

More investigation is required to systematically determine J , the number of Monte Carlo samples in Algorithm A. We can incorporate automated data-driven strategies (e.g. Levine and Casella 2001; Caffo et al. 2005) to the current algorithm to choose an appropriate J at each iteration. This paper evaluates integrals by monte carlo integration. As suggested by a referee, it can be interesting to compare the integral approximation with different numerical integration approaches, such as the Gaussian quadrature rule. 

This appendix outlines our derivation for the asymptotic properties of the estimator F n (t) in Sect. 2.4. To simplify the notation, we define θ = (ν, σ, σ r ) and use θ 0 to represent the true values of the parameters. DenoteF n (t; ν 0 , σ 0 , σ r 0 ) byF n (t). We focus on the realization of the proposed estimatorF n (·) usingθ n A obtained by Algorithm A. The derivation can be adapted to handleF n (·) usingθ n B by Algorithm B with little change.

The following are the regularity conditions assumed in Model (1). We adapt the regularity conditions for the MLE, which are summarized by Serfling (1980) . Condition (C1). The parameter space Θ is compact and the true parameter θ 0 is an interior point of Θ. Condition (C2). The first, second, and third partial derivatives of the loglikelihood function given in (9) with respect to θ exist for each θ ∈ Θ. Condition (C3). Differentiation and integration are interchangeable for first, second and third partial derivatives of the log-likelihood function with respect to θ . Condition (C4). The third partial derivative of the log-likelihood function with respect to θ , is dominated by a fixed integrable function M 2 (·) for every θ ∈ Θ. Condition (C5). Σ(θ 0 ) = E θ 0 ∂ log L obs (θ; O i ) ∂θ ∂ log L obs (θ ; O i ) ∂θ exists and is positive definite, where log L obs (θ ; O i ) is the ith term in (9). It is the same as Π(θ 0 ) = −E θ 0 ∂ 2 log L obs (θ ; O i ) ∂θθ .

The third term on RHS of (17) can be written as √ n(θ n A − θ 0 )E θ 0 ∂ M(t, L * i , B i ; θ ) ∂θ + o p (1), which is √ n(P n − P)Π −1 (θ 0 )∂ log L obs (θ ; Observed-data) ∂θ θ=θ 0

Therefore,F n (t) is asymptotically linear with influence function:

Since this influence function is P-Donsker, √ n(F n (t) − F(t)) converges weakly to a tight, mean-zero Gaussian process G, as n → ∞.

The covariance of the limiting Gaussian process of √ n F n (t) − F(t) is as given in (14) in Theorem 1 (ii). Further, one can use n i=1 ∂ M(t,L * i ,B i ;θ ) ∂θ n| θ=θ n A to estimate E θ 0 ∂ M(t, L * i , B i ; θ ) ∂θ and use Π(θ n A ) to estimate Π(θ 0 ).

Asymptotic behavior of the unconditional NPMLE of the length-biased survivor function from right censored prevalent cohort data

Estimation of a failure time distribution based on imperfect diagnostic tests

Local likelihood density estimation for interval censored data

Ascent-based Monte Carlo expectation-maximization

A semiparametric inverse-gaussian model and inference for survival data with a cured proportion

Modeling the progression of HIV infection

Models for variable-stress accelerated life testing experiments based on wiener processes and the inverse gaussian distribution

Gaussian models for degradation processes-part I: methods for the analysis of biomarker data

Monte Carlo sampling methods using Markov chains and their applications

Inference from coarse data via multiple imputation with application to age heaping

Modeling event times with multiple outcomes using the wiener process with drift

Interim analyses using repeated confidence bands

Nonparametric estimation for length-biased and right-censored data

Covariates and random effects in a gamma process model with application to degradation and failure

Threshold regression for survival analysis: modeling event times by a stochastic process reaching a boundary

Implementations of the Monte Carlo EM algorithm

Using degradation measures to estimate a time-to-failure distribution

Forest fire management

Equation of state calculations by fast computing machines

Estimating time to event characteristics via longitudinal threshold regression models-an application to cervical dilation progression

Development and application of a model for suppression of forest fires

Inverse gaussian processes with random effects and explanatory variables for degradation data

Bayesian random-effects threshold regression with application to survival data with nonproportional hazards

Remarks on some nonparametric estimates of a density function

Survival models for data arising from multiphase hazards, latent subgroups or subject to time-dependent treatment effects

The empirical distribution function with arbitrarily grouped, censored and truncated data

A pseudo-likelihood estimation method for nonhomogeneous gamma process model with random effects

Wiener processes with random effects for degradation data

An inverse gaussian process model for degradation data

A Monte Carlo implementation of the EM algorithm and the poor man's data augmentation algorithms

A lightning fire occurrence model for Ontario

Statistical monitoring of clinical trials with multivariate response and/or multiple arms: a flexible approach

Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations

Acknowledgements We are grateful to N. McLoughlin of Alberta Wildfire Management Branch for providing the wildland fire data. We thank G.A. Whitmore for his many constructive suggestions, and the associate editor and two referees for their helpful comments and suggestions. The research was supported by grants from the Natural Sciences and Engineering Research Council of Canada (NSERC) and a CRT grant from the Canadian Statistical Sciences Institute (CANSSI).

To establish the consistency of the proposed estimator, we first note that the MLEθ n A is consistent and asymptotically normal. That is, under (C1)--(C5),θ n A a.s.Let P n and P be the empirical measure and probability measure of n i.i.d. observationsFor every t ∈ [0, τ ] and ν, σ, σ r belonging to a compact set Θ,Therefore, we can interchange integration and differentiation. Then,whereThe derivatives in (16) are uniformly bounded in probability for every t ∈ [0, τ ] and ν, σ, σ r belonging to a compact set Θ. Therefore, 

Since the class of functionswhere h(l * , b) is the PDF for the joint distribution function of L * and B. Also, for the second term on the right hand side (RHS) of (17), we have

To calculate the variance of G, we write VarThe second term on the RHS of (19) is given in (18) and the third term can be calculated as nE F n (t) −F n (t) F n (t) − F(t) . Noting that |F n (t)| ≤ 1,The first term, Var √ n F n (t) −F n (t) , can be calculated as∂θ and the asymptotic variance ofθ n A is Π −1 (θ 0 ). Thus Var √ n F n (t) −F n (t) converges in probability as n → ∞ to