key: cord-0114589-zktc9m4w
authors: Xia, Lu; He, Kevin; Li, Yanming; Kalbfleisch, John D.
title: Accounting for total variation and robustness in profiling health care providers
date: 2019-07-17
journal: nan
DOI: nan
sha: 033535d77ed87b2ba8ac063ffeabaff1b4e6a66e
doc_id: 114589
cord_uid: zktc9m4w

Monitoring outcomes of health care providers, such as patient deaths, hospitalizations and readmissions to hospital, is important for assessing the quality of health care. Analyses based on random effects and fixed effects are commonly used to assess outcomes and identify providers with unusually poor (or good) outcomes. However, these approaches have shortcomings that can lead to unfair assessments of providers. One primary issue is that they do not consider variation between providers that is outside the providers' control, for example, due to unobserved patient characteristics that vary between providers. In this article, we review the existing fixed- and random-effects methods, and propose a smoothed empirical null approach that accounts for the total variation and adapts to different provider sizes. The linear model with random provider effects provides an illustration of the proposed empirical null method. This extends easily to other distributions to describe, for example, survival or binary outcomes. We also generalize the empirical null method to allow for some variation being due to quality of care. Methods are illustrated with numerical simulations and an example of a mortality measure in the Dialysis Facility Compare program.

In many instances, the quality of medical care differs considerably across providers, and patientlevel clinical outcomes are usually analyzed and summarized into provider-level measures in order to quantify and monitor such differences. Many important types of outcomes are being closely monitored. For example, in the Dialysis Facility Compare program by Centers for Medicare & Medicaid Services (CMS), mortality events are summarized in terms of the Standardized Mortality Ratio (SMR) for each facility, which compares the observed mortality at a dialysis facility to the mortality that would be expected given national death rates for patients with similar characteristics (UM-KECC, 2018) . Hospital readmission of dialysis patients is another outcome of interest. The rate of hospital readmission following a discharge can be indicative of the degree of coordination of post-discharge care between the facility and the hospital (He et al., 2013; Wish, 2014) .

Provider profiling based on clinical outcomes can help patients make better informed decisions in choosing health care providers, and also can aid overseers and payers in identifying providers whose outcomes are worse or better than a normative standard, in order to signal the need for further reviews or to target quality improvement programs. The statistical analysis along with further reviews may lead to changes in budgeting allocations and even suspension for those providers with extremely poor outcomes. Thus, it is of substantial importance to identify which providers should be "flagged" as having health outcomes that are "worse than expected" or "better than expected". Such evaluations require appropriate statistical methods that account for differences in patient characteristics and suitably allow for natural variation between providers. Regression models for patient-level outcomes are typically used to adjust for observed patient characteristics. With patients treated within providers, methods are often based on hierarchical random effects models, in which the provider-specific effects are modeled as independent and identically distributed random variables. A typical analysis of these models is based on the empirical Bayes "posterior" distribution of the random effect for each provider with the provider effect estimated with the posterior mean. We refer to this as the random-effects (RE) approach. As is well known, this analysis yields "shrinkage" estimates, in which the estimated provider effects are shrunk towards the overall average. Many authors have advocated this approach; see, for example, Normand et al. (1997) ; Normand and Shahian (2007) ; Jones and Spiegelhalter (2011); Ohlssen et al. (2007) . An alternative approach treats the provider effects as fixed and utilizes the maximum likelihood estimates of the provider effects. We refer to this as the fixed-effects approach (FE). In both approaches, it is common to use the estimate and its frequency or posterior distribution to assess the sharp null hypothesis that the provider in question has provider-level effect identical to a national norm. While both FE and RE methods have strengths, in many instances they have weaknesses that reduce their suitability for profiling providers. For instance, both methods fail to consider the natural variation between providers, and thus render unfair assessments across different provider sizes. Due to limited space, we present these crucial details in Section 2.

In this article, we primarily consider methods based on the empirical null that take into account the between-provider variation in a robust fashion. One important contribution is that we robustly model the between-provider variance as a function of provider size. The empirical null was introduced in Efron (2004) and Efron (2010) in order to account for over-dispersion of Z-statistics, and Efron (2007) described methods to estimate a single empirical null distribution in order to assess a collection of relatively homogeneous Z-statistics with applications, especially to genetics and controlling false discovery rates. However, this approach that estimates and refers to a single sampling distribution does not apply immediately when providers vary markedly in size. In previous work (Kalbfleisch and Wolfe, 2013; He et al., 2013) , this issue was addressed approximately through stratification. Here, we define an empirical null for each provider size essentially by smoothing the reference empirical null distributions. Our proposal provides a unified framework for provider profiling that can encompass various types of outcomes. Following Kalbfleisch et al. (2018) , this framework is generalized to handle the situation when a provider should be held accountable for an externally specified proportion of the total variation that is due to the quality of care.

This article is organized as follows. Section 2 considers a simple hierarchical linear model, and reviews FE and RE methods. Section 3 introduces the empirical null approach using stratification as well as methods that smooth the empirical null based on provider size, both in the linear model and in extensions to widely used non-linear models. In Section 4, we evaluate the empirical null methods through simulation studies with continuous and survival outcomes. Finally, we summarize the connections between methods and discuss their potential issues and extensions in Section 5.

2 Standard profiling methods based on the linear model Let i = 1, · · · , N index providers and j = 1, · · · , n i index patients within the ith provider. We consider an underlying linear regression model

where Y * ij represents the continuous outcome of interest, µ is the grand mean, α i is the provider effect, X ij is a vector of patient characteristics, and β is a vector of regression coefficients. We assume that large values of Y * ij correspond to poor outcomes. Our primary goal is to identify providers whose outcomes are worse than expected. Thus, in a hypothesis testing framework, we use one-sided tests with significance level ρ throughout this paper. For example, ρ = 0.05 in a one-sided test targeting poor outcomes corresponds to a two-sided test with significance level 0.1. Another typical choice is ρ = 0.025, which corresponds to often used two-sided tests with significance level 0.05.

In many instances, we assume that the provider effects α i 's are random, and conditional on

In this model, the parameters µ, σ α , σ w , and β are structural and appear in the probability laws for all observations. In what follows, we assume that the total numbers of providers, N , and patients, i n i , are large so that µ, σ α , σ w and β can be precisely estimated, and we proceed without considering the variation associated with their estimates. Thus, we replace Y * ij 's with the risk adjusted responses, Y ij = Y * ij −β T X ij , whereβ, based on fixed effects for the α i 's, is a consistent estimate of β.

In a fixed-effects (FE) analysis, we consider the provider effect α i in (1) as fixed with a constraint,

As noted above,Ȳ is an accurate estimate of µ as iŝ σ w of σ w . The FE Z-score for a test of α i = 0 is

which, if α i = 0 under the null, has a N (0, 1) distribution. This can be used as the reference distribution to flag unusual providers. Then, for example, one approach might be to flag the ith provider if Z FE,i > z ρ , where z ρ is the upper ρth quantile of the standard normal distribution and ρ is the nominal significance level of the related one-sided test. We use ρ = 0.05 throughout this article. Note that this approach provides a test of the sharp null hypothesis, and makes no allowance for any unexplained variation among the providers that is not due to the quality of care.

The usual random-effects (RE) analysis of (1) is based on the assumption (2) that α i iid ∼ N (0, σ 2 α ), i = 1, · · · , N , and that the provider effects α i are independent of patient characteristics X ij , j = 1, · · · , n i . The latter is an important assumption in RE analysis that is rarely met or noted. The confounding between α i and X ij can lead to substantially biased estimates of β and can alter the estimates of α i . One way to address such bias is to utilize the fixed-effects estimate of β, i.e.β, to provide an offset. As described above, we can obtain the risk adjusted response, Y ij , and then estimate σ 2 α , σ 2 w , µ and α i 's assuming the model Y ij = µ + α i + ij . Given the data, an empirical Bayes approach gives the approximate posterior distribution of

where R i = σ 2 α σ 2 α + σ 2 w /n i . This approach yields an estimateα RE i =R i (Ȳ i −Ȳ ) that is shrunk toward zero byR i . These RE estimates are conditionally biased; if the ith provider has true effect α i ,

The corresponding RE Z-score for a test of α i = 0 from (4) is

which has a posterior reference distribution N (0, 1). Thus, analogous to the flagging rule for FE in Section 2.1, we flag a provider using the RE approach if Z RE,i > z ρ . Note that this N (0, 1) reference is based on the posterior distribution (4). In fact, the sampling distribution for Z RE,i given α i = 0 is approximately N (0, R i ) rather than N (0, 1).

FE and RE approaches are most appropriate when all of the variation in the α i 's is due to the quality of care, so that the risk adjustment has been complete. In many instances, however, much of the variation between providers is not due to the quality of care, but rather to factors that are outside the providers' control. Jones and Spiegelhalter (2011) and Kalbfleisch et al. (2018) explicitly stressed that this variation should be taken into account in profiling. In the simple case of the linear model (1), an approach can be based on FE estimates while accounting for their total variances. We refer to this as the FERE approach. The corresponding Z-score is

Analogous to FE and RE approaches, one might flag the ith provider if Z FERE,i > z ρ . This is not a test of the sharp null hypothesis α i = 0, but evaluatesȲ i based on the marginal distribution in (1), or assesses whetherȲ i could reasonably have arisen from the model (1).

FE and RE methods are often used in profiling medical providers and are most appropriate when all of the variation in provider effects is due to the quality of care provided. Both methods make reference to a sharp null hypothesis, typically of H 0i : α i = 0. Usually based on the hierarchical model, the RE approach is often thought to account for the variation between providers. It has been promoted largely on the basis of the well-known result that, compared to FE, it stabilizes and improves overall precision of estimation of the provider effects, which is considered to benefit estimation and prediction performance by borrowing information from other providers (Jones and Spiegelhalter, 2011; Efron and Morris, 1973; Louis, 1991; Ash et al., 2012) . In other words, the overall mean squared error of the estimated provider effect is reduced by shrink-

This result is important in some contexts, but not in the context of profiling. Consider, for example, the conditional mean squared error (MSE), Figure 1 , where σ α = 1, σ w = 5 and n i = 100. The overall MSE reduction by RE is obtained as an average of substantial losses in MSE for extreme values of α i coupled with modest gains for the more frequent values of α i closer to 0. This is an example of an overall average missing important local features. However, for profiling, it is the extreme values that are of primary interest. For this reason among others, when the between-provider variation is entirely due to the quality of care, the FE estimates appear preferable for profiling. Although the FE approach desirably produces unbiased provider effect estimates, it suffers from a well-documented phenomenon, "overdispersion", which usually refers to the fact that the FE Z statistics exhibit higher variability than the FE model would suggest (Efron, 2004; Spiegelhalter et al., 2012) . We can compute the marginal variance of the FE Z-score assuming randomness in the α i 's:

which shows the variability in FE Z-scores, especially among large providers, is much larger than the reference N (0, 1) distribution. The FE approach does not take into account such unexplained variation and usually assesses the sharp null hypothesis of a null provider effect. A large sample size can translate into more accurate estimation of the provider effect, and any deviation in α i from 0 will eventually be detected by FE as n i → ∞. This explains why the FE approach disproportionately flags more large providers even when some of their provider effects are very small. Similarly, the marginal variance of Z RE,i increases linearly with n i , and the problem of spuriously large flagging rates for large providers persists. With a large provider size n i , FE and RE are nearly identical. Hence, it is important to assess the degree of overdispersion and/or the natural variability between providers. In many instances, much of the variation in α i 's is outside the control of providers and cannot be attributed to the quality of care. This arises, for example, when there are unmeasured covariates associated with the patients that are strongly predictive of outcomes and that vary substantially between providers. Factors such as socio-economic status, genetic differences and comorbidities can vary substantially across providers, yet are generally poorly measured. When most of the variation between providers is outside of the providers' control, the FERE approach is preferred and its advantages have been discussed in the literature before, often in the context of funnel plots illustrating the way in which increasing provider size affects variation in the FE estimates (Jones and Spiegelhalter, 2011) . Providers that give rise to a large value of Z FERE,i have extreme values with reference to the population of all providers, including all sources of variation.

However, the usual FERE approach is not robust. When there exists a substantial proportion of outlying providers with extreme outcomes, the between-provider variance σ 2 α is overestimated, which compromises the power of the FERE approach in identifying extreme providers. Our aim in the next section is to outline the empirical null approach that imposes a fair assessment of all providers, is efficient in detecting extreme provider effects of real interest, and is robust to outliers. It should be noted that it would also be possible to develop robust estimates of σ 2 α (e.g. Koller 2016) and so improve the FERE method; yet generalizing them to non-linear models would be very difficult.

3 Profiling based on the empirical null Similar to FERE, the empirical null (EN) method is designed to address overdispersion and allows for between-provider variation in identifying outliers. The empirical null was introduced in Efron (2004) and Efron (2010) in the context of multiple testing and controlling false discovery rates. A stratified EN approach was previously implemented, in Kalbfleisch and Wolfe (2013) and He et al. (2013) for time-invariant profiling, and in Estes et al. (2018) for time-dynamic profiling. The EN uses variation in the FE Z-scores, Z FE,i , i = 1, . . . , N , as an appropriate metric for assessing extreme values.

Suppose that the model (1) holds, with α i iid ∼ N (0, σ 2 α ) and independent of ij 's, for the large majority of providers under consideration. We refer to the majority rather than all of the providers since we aim to reduce or eliminate the impact of outlying providers. For now, we assume all providers have an equal number of patients, i.e. n i = n for i = 1, · · · , N . The empirical null distribution is defined as the normal distribution with mean,μ M , and variance,σ 2 M , which is estimated robustly from the fixed-effects Z-scores Z FE,i , i = 1, . . . , N . This distribution is used in place of the reference N (0, 1) for hypothesis testing. For example, the ith provider is flagged as "worse than expected" if Z FE,i >μ M + z ρσM .

To obtain robust estimatesμ M andσ 2 M , we adapt the "MLE fitting" method (Efron, 2007) . This is based on a mixture model in which a proportion p M of providers have Z-scores arising from the empirical null distribution, N (µ M , σ 2 M ), whereas the remainder, including any outliers, come from another non-null distribution. The target quantities µ M and σ 2 M are calculated based on maximizing the mixture likelihood so that the estimated mean and variance are robust to outliers. More specifically, and following Efron (2007) , we assume that the non-null distribution has support outside an interval

M ) are reasonable initial estimates for (µ M , σ M ), and z 0 > 0 is specified (e.g. 1.2, 1.64, 1.96, or 2). Let

is the cumulative distribution function of N (0, 1), and θ = pQ(µ, σ). Then the likelihood based on the observed Z-scores is

One can numerically find the MLE

For this purpose, we follow Algorithm 1, which computes the profile likelihood of p M and guarantees thatp M does not exceed 1. The maximization in Step 2 is over a 2-dimensional parameter and is very fast using existing optimization methods (e.g. Nelder and Mead 1965) .

A robust M-estimation technique can be used to obtain the preliminary estimates (Huber, 1964 (Huber, , 1973 Andrews et al., 1972) , and we use the bi-weight function, which is the default in SAS® ROBUSTREG procedure (Chen, 2002) . The robust M-estimation is simple, but can overestimate the variance with many outliers present, and so we prefer the approach outlined above. We have also found that moderate values of z 0 such as Φ −1 (0.9) = 1.28 or Φ −1 (0.95) = 1.64 avoid including outliers while allowing good estimation of µ M and σ M .

Next we consider a more realistic setting where the sample sizes n i , i = 1 . . . N vary across providers. Since the variability of FE Z-scores depends on n i , we first stratify providers into a few groups based on their sample sizes, as is proposed in Kalbfleisch and Wolfe (2013) and He et al. (2013) . The robust strategy outlined above can be applied to each of the groups separately. For instance, we might stratify the providers into three groups based on the tertiles of sample sizes {n i }. Within group g, we obtain the estimatesμ M,g andσ M,g , and an empirical null distribution N (μ M,g ,σ 2 M,g ), g = 1, 2, 3. Each health care provider is compared to the empirical null corresponding to its group.

As an illustration, we consider mortality data on patients treated at dialysis facilities in the U.S. over four calendar years, from 2012 to 2015. This involves an extension of the EN methods to nonlinear models as is discussed in detail in Section 3.3 below. The Standardized Mortality Ratio (SMR) for a given facility is the ratio of the observed number of deaths in that facility to the expected number under a national norm. Thus, it assesses whether that facility is above or below the national, and by how much. We include 6,363 facilities with expected number of deaths of 3.0 or more. For these, the number of observed deaths ranges from 0 to 581, the expected deaths from 3.0 to 308.6, and the facility sizes ranges from 6.9 to 1569.8 patient-years. Facility size tertiles, defined by cut points at 156.9 and 302.8 patient-years, stratify the facilities into small, medium and large facilities. Technical details on computation of the expected number of deaths, SMR, p-values and FE Z-scores are described in Section 3.3.

The estimated means of one-sided Z-scores are −0.02, −0.07 and −0.13 for the small, medium and large groups respectively, and the corresponding variance estimates are 1.23 2 , 1.40 2 and 1.61 2 (see Figure 2 ). These estimates define the empirical null distributions for the three strata. From Figure 2 we find that the empirical null distributions are more dispersed than the standard normal, and that the variation of Z-scores increases with provider size. As a rule for flagging providers with poor outcomes, we might use the upper 5% quantiles in the empirical null distributions. Those facilities with Z-scores larger than the corresponding critical values are flagged. Figure 3 visualizes the stratum-specific critical values of Z-scores in the scatter plot.

The stratified empirical null approach has some limitations, as evident from Figure 3 . First, the choice of three groups is arbitrary and a different number of strata will result in some changes in the list of providers being flagged. A second related problem is the discontinuity of the critical region at the stratum boundaries. Consequently, two providers near a boundary may be of similar size and have similar Z-scores, yet one of them may be flagged and the other not, due only to the arbitrary choice of boundaries. To overcome these issues, we model the mean and variance of the empirical null distributions as smooth functions of provider size, and use robust techniques to lessen the impact of potential outliers. We continue using the SMR example for illustration throughout this section.

To estimate the regression of the variance on the sample size, we first, obtain variance estimates in each of G groups defined by quantiles of provider sample size. We then regress these variance estimates on the median sample size in each group. Specifically, we proceed as follows. Within the gth group, we use the local "MLE fitting" described in Section 3.1 to estimate the meanz g and varianceσ 2 g of Z-scores, and letm g represent the median sample size in this group, g = 1, · · · , G. Focusing first on the variance estimates, we fit a regression model with variance estimates {σ 2 g } as dependent and median sample sizes {m g } as independent variables. Based on our empirical studies and theoretical derivations, a linear regression of the form γ 0 + γ 1mg will typically suffice. We include weights in the regression in order to reduce the leverage of the variance estimates with large variability corresponding to large providers. Algorithm 2 describes the iteratively reweighted approach and we denote the fitted variance values byσ 2 i , i = 1, · · · , N . One important issue in this approach is the choice of the total number of groups G. A small G will not provide sufficient information for estimating regression coefficients precisely in a linear model, while a large G results in each group having a very small number of providers; thus, the variance estimation is unstable. From our experience with simulations of the SMR example, we find it satisfactory to choose G so that there are 50 to 300 providers in each group. Based on our experience, estimation of the regression line is stable over this range of G.

To estimate the mean Z-score as a function of provider size, we use a weighted smoothing technique such as smoothing spline, B-spline or LOESS on data {(z g ,m g )} G g=1 . Here, the weight associated with the estimatez g in the gth group is inversely proportional to the variance estimatê σ 2 g from Algorithm 2. One can easily find existing implementations for these methods, for example, smooth.spline() for smoothing spline in R. For the provider size that falls outside the range of {m g } G g=1 , one may extrapolate either with the estimated linear function or with a plateau so that the mean function remains continuous and flat. The fitted Z-scores from smoothing are denoted asẐ i , i = 1, · · · , N .

We propose this group-based smoothing instead of direct smoothing based on the original Z-scores, because the MLE fitting within each group is more robust against potential outliers.

Examination of the ith provider may proceed based on the individualized empirical null distribution N (Ẑ i ,σ 2 i ). Similar to other methods, the ith provider is flagged as worse than expected if Z FE,i > Z i + z ρσi , for a one-sided test nominal level ρ. Smoothing provides a better approximation to the total variance of Z-scores and avoids any unfairness associated with simple stratification. When the normal linear model (1) holds for all providers, this approach gives an almost identical result to the FERE approach. Figure 3 shows the critical line for flagging providers with poor outcomes using the smoothed empirical null (z ρ = 1.64).

Many types of outcomes are monitored for quality, including patient hospitalization or readmission, and failure time endpoints such as patient death or graft failure in a kidney transplant. Measures of mortality are typically based on a survival model such as the Cox proportional hazards model that allows adjustment for patient characteristics. Similarly, measures of hospital readmission within thirty days following a hospital discharge are based on a model for binary outcomes such as logistic regression. The empirical null approach can be readily extended to non-linear models and provides a unifying framework for profiling providers.

In this section, we describe an extension to measures of mortality. Similar implementations and comments are directly applicable to other measures. FE and RE methods analogous to those in the linear model are sometimes used in these contexts, but RE methods tend to be complicated and subject to the same concerns as described above for the linear model. The key to applying the empirical null is to obtain FE Z-scores. Because provider effects are typically included in models for profiling, one option is to estimate these fixed effects, and FE Z-scores can be obtained directly as Wald-type test statistics for the sharp null hypothesis that the true provider effect is the same as a population norm. Alternatively, Z-scores can be obtained by converting p-values based on fixed effects.

As an example, THE SMR assesses patient death events of a health care provider. For the ith provider, we denote the observed number of deaths by O i . A two-stage modeling procedure is used to obtain the expected number of deaths under the assumption that patients at this provider experience death events at the national average rate (UM-KECC, 2018). The following is the approach taken by CMS in Dialysis Facility Compare.

In the first stage model, the hazard function of the jth patient in the ith provider is assumed to be λ ij (t) = λ 0i (t) exp{X T ij β}, where λ 0i is a provider-specific baseline hazard, and β represents the regression coefficients associated with the observed patient-level characteristics X ij such as age, gender, race, BMI and a selected set of comorbidities. This stratified Cox model is fitted to the national data in order to estimate β , denoted asβ. Stratification by providers is important to accurately estimate the within provider effects of covariates, β.

In the second stage, the "population-average" cumulative baseline hazard Λ 0 (t) = t 0 λ 0 (u)du is estimated through an unstratified Cox model with an offset ofβ T X ij obtained from the first stage. Conditional on patient characteristics X ij and at-risk process Y ij (t), the "expected" number of events for the jth patient in the ith provider is calculated as

where τ is the maximum follow-up time. For the ith provider, the expected number of events is E i = n i j=1 E ij and the corresponding SMR is estimated by SM R i = O i /E i . If SM R i > 1(< 1), the ith provider experiences more (fewer) deaths than expected under the national norm, based on the observed characteristics of patients. Note that E ij is actually a sum of conditional expectations and is in fact the compensator in the martingale corresponding to the counting process for the (i, j)th individual.

A test of the sharp null hypothesis H 0i : SM R i = 1, where SM R i is the true underlying SMR for the ith provider, can be obtained using a Poisson approximation whereby the number of events O i is assumed to follow a Poisson distribution with mean E i . In this case, the one-sided mid p-value for the alternative hypothesis H 1i : SM R i > 1 is

where X ∼ P oisson(E i ) and these can be converted to Z-scores using Z FE,i = Φ −1 (1 − p i,1 ). By this convention, large values of Z FE,i are associated with poor outcomes. Note that mid p-values are used here to avoid difficulties in converting p-values to Z-scores. We choose the Poisson-based p-values in this example rather than a normal approximation with Z-scores

are more accurate when the provider size is small.

3.4 Allowing some of the variation to be due to quality of care Similar to FERE, the empirical null approach presented above takes account of the total variation in the Z-scores. This is appropriate when most or all of the between-provider variation is due to incomplete risk adjustment, as opposed to the quality of care. More generally, Kalbfleisch et al. (2018) suggests introducing a value, λ, that represents the proportion of the between-provider variance that is due to incomplete risk adjustment, and holding providers accountable for a proportion of 1 − λ of the between-provider variation. Since λ is not estimable, its value might be based on expert opinion. In the linear model (1), we write α i = α i1 + α i2 , where α i1 ∼ N (0, (1 − λ)σ 2 α ) represents the portion of the effect due to the quality of care whereas the independent effect, α i2 ∼ N (0, λσ 2 α ), is variation that is outside the provider's control. In this case, it is natural to base profiling on an assessment of the hypothesis H λ : α i1 = 0. Under this hypothesis, the null distribution for Z FE is N (0, (λσ 2 α + σ 2 w /n i )/(σ 2 w /n i )). This is a natural generalization that connects the FE and FERE approaches discussed in Section 2, which correspond respectively to λ = 0 and λ = 1. Furthermore, for a general 0 ≤ λ ≤ 1, the null distribution for Z FE,i can also be written as

, the shrinkage factor defined earlier, is also referred to as the inter-unit reliability (IUR).

Analogous to the relaxation in the linear model above, we can extend the idea of decomposing the between-provider variance to the empirical null in non-linear models. Suppose we have obtained the empirical null distribution for a provider, N (Ẑ i ,σ 2 i ). The IUR, in general, represents the proportion in the total variance that the between-provider variance takes, and can be easily computed in non-linear models . Then, for the ith provider with an IUR equal to r i , the variance to be allowed iŝ

Additionally, if the FE Z-scores are computed as Wald statistics or asymptotically equivalent statistics by assuming fixed provider effects, the variance in the reference distributionσ 2 i can be approximated by 1/(1−r i ) when all between-provider variation is due to incomplete risk adjustment (λ = 1), andσ 2 λ,i in (9) can also be written asσ 2 λ,i = 1 − λ + λσ 2 i . The new reference distribution allowing a proportion λ of the between-provider variance is N (Ẑ i ,σ 2 λ,i ). Note that the incorporation of λ simply changes the critical value of the test based on the empirical null.

In this section, we examine the performance of the empirical null method through simulation studies. First, we restrict all providers to have the same sample size, and compare the probability that providers give rise to a signal under different approaches. Second, to illustrate the use of empirical null in non-linear models, we simulate survival outcomes to mimic the real data example SMR.

We assume the true model (1), with µ = 0, β = 0, ij ∼ N (0, 16), and α i ∼ N (0, 1) for i ≥ 2 with α 1 fixed at a value varying from 0 to 3.5. We simulate N = 200 providers of size n i = n for all i, where n = 10, 25, 50, 100. The simulation is repeated 1000 times. For one-sided tests, providers whose Z-scores exceed the corresponding critical value Φ −1 (0.95) = 1.64 are flagged as worse than expected. Figure 4 shows the empirical probability of signaling provider 1 for FE, RE, FERE and EN approaches and under different scenarios. The FE approach flags provider 1 with the highest probabilities in all cases. As expected, the difference between FE and RE diminishes for a large sample size n. Without outlying providers, FERE and EN result in almost identical probabilities of signaling, reflecting their asymptotic equivalence in this setup, as noted above. For a large sample size, e.g. n = 100, standard FE and RE methods signal provider 1 with moderate to high probability even with relatively small values of α 1 , say α 1 = 0.5 or 1. These values of α 1 are well within the range of variation expected for α i under the true model. On the other hand, FERE and EN allow for this variation in the outcomes and do not signal with very high probability until α 1 > 2.0, that is until the effect is in the tail of the distribution of provider effects (σ α = 1). It should be noted that the exact probability of flagging can be easily calculated and plotted for all methods except EN. We present the empirical probability in all cases to facilitate fair comparison.

To illustrate the robustness of EN compared to FERE, we also simulate N = 3, 000 providers with 5% outliers. Still, we assume model (1) for the majority of providers, with µ = 0, β = 0, σ w = 4, and σ α = 1 except that α 1 is fixed at a value varying from 0 to 3.5σ α . Half of the outliers have provider effect α i = 4σ α and the other half α i = −4σ α , well outside the true distribution for the majority of provider effects. Sample size n i is simulated from a uniform distribution on integers {10, 11, · · · , 150} for all providers except provider 1. For provider 1, its sample size n 1 = 25, 50, 100, 125. The EN is smoothed using the methods in Section 4. The simulation is repeated 1000 times. Figure 5 plots the proportion of times that provider 1 is flagged in the presence of 5% outliers. In this setting, the EN method results in almost identical performance compared to the case when the true variance parameters σ α and σ w are known (black dashed line), and is more robust than FERE, which has lower flagging proportions due to over-estimation of the between-provider variance, especially when α 1 is large.

In this subsection we consider a realistic situation where providers are of different sizes and the smoothed empirical null method of Section 3.2 is needed. Survival outcomes that mimic the SMR example in Section 3.3 are simulated. A similar study where providers are of the same size is presented in the Supplementary Materials Section S1.

We generate N = 2000 providers whose sample sizes are simulated from a uniform distribution on integers in [10, 200] , and then fixed throughout all replications. For the jth subject in the ith provider, the survival time T ij follows an exponential distribution with hazard λ ij = 0.1 × exp{α i + X ij1 β 1 + X ij2 β 2 }, where β 1 = 1 and β 2 = −1,α i iid ∼ N (0, 0.2 2 ) are the provider effects, and the covariates X ij1 and X ij2 are independent N (0, 1) variables. The censoring time C ij iid ∼ U nif (10, 30), which generates approximately 27% censoring. We fit a Cox PH model with facilities as strata and obtain regression coefficient estimates for the covariates. The raw p-values and corresponding Z-scores are computed as described in Section 3.3. We implemented the smoothed empirical null approach (in Section 3.2) with different numbers of groups G = 5, 20, 40, 60. Linear regression models are fitted to the group-wise variance estimates of Z-scores, and weighted smoothing splines to the group-wise mean estimates. We repeat the simulation 500 times. Figure 6 (a) and Figure 6 (b) show the estimated mean and variance functions of Z-scores in one replication when the number of groups G = 20. Results with different G in Section S2 in the Supplementary Materials show no sensitivity to the selection of G when it lies within an appropriate range of 5 to 60. Our proposed method captures the main features of the mean and variance functions while being smooth enough to provide consistent flagging rules within the considered range of provider size. The proportion of times providers are flagged across the 500 simulation replications is summarized in Figure 6 (c), stratified into three groups by provider size. The FE approach flags a provider if its one-sided p-value is less than 0.05, which unfairly flags over 25% of the large providers and only about 15% small providers. In contrast, the EN approach has very stable flagging rates around 6% for all three groups.

If it is assumed that a proportion λ of the between-provider variance is due to incomplete risk adjustment, then the methods of Section 3.4 can be used. Following the same simulation setup with survival outcomes above, we consider λ = 0, 0.5, 0.75, 1 and modify the allowed varianceσ 2 λ,i in the reference distribution accordingly. Figure 7 shows box plots of the proportion of times in 500 replications that providers are flagged, stratified into three groups. The empirical null approach with the relaxed factor λ can be viewed as a hybrid of the fixed-effects analysis and the empirical null with total variance (λ = 1), hence its flagging rates lie between the latter two. When λ = 0.5, 0.75, the flagging rates increase with provider size, a feature inherited from FE.

In this article, we examine the performance of random-and fixed-effects methods for profiling health care providers, and propose a smoothed empirical null approach that takes full account of the between-provider variation. In the EN and the related FERE approach, the total variation between providers is taken into account in profiling and these methods are useful when primary interest is in identifying outliers. On the other hand, an assessment of the ith provider by FE and RE approaches implicitly assume that all of the between-provider variation is due to the quality of care. In most applications, we believe that much of the variation between providers is not under the providers' control and that the methods based on FERE or the empirical null are often appropriate. Importantly, however, it is possible to relax the EN approach to allow for a given proportion of the between-provider variation to be due to the quality of care. The EN approach is more robust than FERE in that it involves a robust estimation of model parameters. Within the normal model, this deficiency in FERE could be remedied by introducing robust estimation in linear mixed-effects models as discussed, for example, in Koller (2016) .

Generalization of RE and FERE methods to non-linear models, such as the binary logistic model or the Cox proportional hazards model, is complicated. For example, methods developed for assessing hospitals based on readmission rates were based on an RE analysis of a hierarchical binary logistic model Horwitz et al. (2011) , and entailed complicated bootstrap techniques to assess significance. The EN method, however, generalizes immediately as described in Section 3.3, and is applied to a more complicated binary logistic model for hospital readmissions of dialysis patients in He et al. (2013) .

In the EN approach, we have assumed that there are a large number of providers to be profiled and that the central part of the histogram of z-scores is well described by a normal distribution. These two assumptions are satisfied in many applications. When the number of facilities is much smaller, it may be important take into account uncertainty in the estimation of the empirical null distribution. Also, in some instances, it may be useful to use a transformation of the basic measure in order to achieve approximate normality of the empirical null. In other cases, there may be situations where a non normal distribution that incorporates skewness, for example, is more appropriate. These are areas where additional work is needed.

As noted earlier, an FE analysis to estimate β and then use of an offset is one approach to correct for the confounding between covariates and provider effects in the RE method. An alternative approach is to include both a within and a between regression coefficient in the model Y ij = µ + α i + β T w (X ij −X i ) + (β b + β w ) TX i + ij as described in Neuhaus and Kalbfleisch (1998) . A feature which sometimes arises is that the between-provider regression coefficient, β b = 0, in which case the within coefficient does not make a full adjustment for the covariates under consideration. This would arise, for example, if even having adjusted for a variable like race, one found that there was still a difference between providers according to the racial mix that they treat. Such situations require careful consideration, and a discussion of issues associated with this can be found in Kalbfleisch et al. (2018) . Step 1 Specify a grid of K values for p (e.g. [0.5, 1] by increment of 0.001), denote these as {p (1) , p (2) , · · · , p (K) }.

Step 2 For each k = 1, 2, · · · , K, compute (k) = max µ,σ L(µ, σ, p (k) ) and record the maximizer (μ (k) ,σ (k) ).

Step 3 Findk = argmax k (k) . Then (μ M ,σ M ,p M ) = (μ (k) ,σ (k) , p (k) ). Algorithm 2 Iteratively re-weighted algorithm for the variance function

Step 1 Set t = 0 and let (γ

1 ) be the ordinary least squares estimates.

Step 2 Find (γ 

where ω (t)

1m g ) −2 and N g is the number of providers in the gth group.

Step 3 Check for convergence and if not, update t ← t + 1 and return to Step 2. Figure 7 : Boxplots of empirical probability of signal summarized over 500 replications, all providers stratified into three groups by size. A proportion λ = 0, 0.5, 0.75, 1 of the between-provider variance is assumed to be due to incomplete risk adjustment, and the rest due to the quality of care received. λ = 0 corresponds to the fixed-effects analysis.

Robust Estimates of Location: Survey and Advances

Statistical issues in assessing hospital performance. Commissioned by the Committee of Presidents of Statistical Societies for the Centers for

Assessment-Instruments/HospitalQualityInits/Downloads/Statistical-Issues-in-Assessing-Hospital-Performance.pdf

Paper 265-27 Robust regression and outlier detection with the ROBUSTREG procedure

Large-scale simultaneous hypothesis testing: the choice of a null hypothesis

Size, power and false discovery rates

Large-scale inference: empirical Bayes methods for estimation, testing, and prediction

Stein's estimation rule and its competitors -an empirical Bayes approach

Time-dynamic profiling with application to hospital readmission among patients on dialysis

Evaluating hospital readmission rates in dialysis facilities; adjusting for hospital effects

Inter-unit reliability for nonlinear models

Hospital-wide (all-condition) 30-day risk-standardized readmission measure

Robust estimation of a location parameter

Robust regression: asymptotics, conjectures and Monte Carlo

The identification of "unusual" health-care providers from a hierarchical model

Does the inter-unit reliability (IUR) measure reliability?

On monitoring outcomes of medical providers

robustlmm: an R package for robust estimation of linear mixed-effects models

Assessing, accommodating, and interpreting the influences of heterogeneity

A simplex method for function minimization

Between-and within-cluster covariate effects in the analysis of clustered data

Statistical methods for profiling providers of medical care: issues and applications

Statistical and clinical aspects of hospital outcomes profiling

A hierarchical modelling framework for identifying unusual performance in health care providers

This work was partially supported by The Centers for Medicare and Medicaid Services [contract number HHSM-500-2008-000211]. Conflict of Interest: None.

Spiegelhalter, D. J., Sherlaw-Johnson, C., Bardsley, M., Blunt, I., Wood, C., and Grigg, O. (2012) .Statistical methods for healthcare regulation: rating, screening and surveillance. Journal of the Royal Statistical Society: Series A (Statistics in Society), 175 (1)