key: cord-0668880-46l7dmn0
authors: Cerullo, Enzo; Jones, Hayley E.; Carter, Olivia; Quinn, Terry J.; Cooper, Nicola J.; Sutton, Alex J.
title: Meta-analysis of dichotomous and ordinal tests without a gold standard
date: 2021-03-11
journal: nan
DOI: nan
sha: fd5fd859e5845c711b6451bd49e7d4a7c641e50f
doc_id: 668880
cord_uid: 46l7dmn0

Standard methods for the meta-analysis of medical tests without a gold standard are limited to dichotomous data. Multivariate probit models are used to analyze correlated binary data, and can be extended to multivariate ordered probit models to model ordinal data. Within the context of an imperfect gold standard, they have previously been used for the analysis of dichotomous and ordinal tests in a single study, and for the meta-analysis of dichotomous tests. In this paper, we developed a hierarchical, latent class multivariate probit model for the simultaneous meta-analysis of ordinal and dichotomous tests without assuming a gold standard. The model can accommodate a hierarchical partial pooling model on the conditional within-study correlations, enabling one to obtain summary estimates of joint test accuracy. Dichotomous tests use probit regression likelihoods and ordinal tests use ordered probit regression likelihoods. We fitted the models using Stan, which uses a state-of-the-art Hamiltonian Monte Carlo algorithm. We applied the models to a dataset in which studies evaluated the accuracy of tests, and test combinations, for deep vein thrombosis. We first demonstrated the issues with dichotomising test accuracy data a priori without a gold standard by fitting models which dichotomised the ordinal test data, and then we applied models which do not dichotomise the data. Furthermore, we fitted and compared a variety of other models, including those which assumed conditional independence and dependence between tests, and those assuming perfect and an imperfect gold standard.

to be life-threatening. A potential complication of DVT occurring in up to a third of patients 18 is pulmonary embolism (PE). PE occurs when a blood vessel in the lungs becomes blocked by a blood clot (formed as a result of DVT) which has migrated from the legs to the lungs. Contrast venography is generally considered to be a gold standard for DVT, as it is almost 100% sensitive and specific 19, 20 . However, it is not commonly used in clinical practice because it is time consuming and invasive 19, 20 . Instead, ultrasound is often used to diagnose DVT, since it is non-invasive and cost-effective 18, 21, 22, 23 . However, it is less accurate than contrast venography 24, 25 for both distal and proximal DVT, with its sensitivity being lower for distal DVT 24 . Furthermore, although ultrasound is known to have a very high specificity, it is still nonetheless imperfect 24, 25 . A commonly used 18 screening tool for DVT is a questionnaire called the Wells score 26 , which groups patients into one of three risk categories -'low', 'intermediate', or 'high'. Another DVT test is the D-Dimer assay: a blood test measuring the amount of a protein fragment called D-Dimer, higher concentrations of which are indicative of DVT. Despite being considered to be generally more accurate than the Wells 27 , the D-Dimer assay is intended to be used for screening as opposed to diagnosis 27 , since a number of other conditions can elevate serum D-dimer concentrations 18 .

Investigating the joint test accuracy of the aforementioned tests for DVT is important for a variety of reasons. The Wells and D-Dimer are both relatively cheap, quick and non-invasive to carry out, particularly the Wells test. A combined screening approach utilising the Wells and D-Dimer may be more cost-effective and reduce test burden for patients compared to using either alone. Furthermore, despite the fact that neither the Wells nor the D-Dimer alone are generally considered to be diagnostic tools for DVT, they may have diagnostic potential when combined 17, 24, 28 . An example of a potential screening strategy is to use the Wells prior to the D-dimer in the diagnostic pathway as a pre-screening tool to rule out individuals at low risk for DVT. Following this, individuals who scored as intermediate or high risk are subsequently screened using the D-Dimer assay, and only patients who also test positive on the D-Dimer undertake ultrasound. Another potential strategy is to refer patients scoring as high risk on the Wells score directly to ultrasound. Both of the aforementioned joint testing strategies are examples of 'believe the negatives' (BTN) strategies 17 . This is a testing strategy where only those patients who test positive on an initial test go on to receive a second test, then only individuals who also test positive on the second test are considered positive. Conversely, 'Believe the positives' (BTP) is a testing strategy where only those patients who test negative on the first test go on to receive a second test, with only those patients who also test negative on this test being considered negative. Joint testing strategies are important across clinical areas besides DVT, for example for depression screening and for COVID-19 -see discussion section 5.1 for more details.

Novielli et al 17 proposed a statistical model in order to conduct a meta-analysis of studies investigating the D-dimer, Wells score and ultrasound for DVT. The proposed model allowed them to model the Wells score without dichotomising the data whilst modelling the conditional dependence between tests, enabling them to estimate summary-level joint test accuracy. However, their model assumes that ultrasound is a perfect gold standard, which could have led to biased estimates of the performance of other tests under evaluation. Novielli et al 17 carried out several analyses based on different datasets -for instance, one based on the 11 studies which directly compared the D-dimer, Wells' score via the gold standard (ultrasound), and another which also included studies which only analysed one of Wells or D-dimer tests, and utilised indirect comparisons. In section 4 of this paper, we re-analyse the direct comparisons data (see table 1) from Novielli et al 17 without assuming a perfect gold standard, using a variety of models we propose in section 3; namely, models which dichotomised the Wells score and those which modelled it as an ordinal test, those which assumed conditional independence and dependence between tests, as well as models which assumed ultrasonography was perfect or imperfect. This dataset consisted of 11 studies, with a total of 4096 individuals and 12,288 observations, with all 11 studies evaluating all three tests. 

Before describing our proposed Bayesian MVP-LC model, we will first define some terminology and notation in section 3.1. For a formal model specification, please refer to the full technical model specification (supplementary material 1).

The model is for a meta-analysis dataset with a total of S studies, with each study having a total of N s individuals, where s is an index for study -so s can be used to denote anything between the first (s = 1) and the last (s = S) study. Each study is assessing the same number of tests, T . Will we use t as an index for test, which can be between 1 and T , and n as an index for individual, which can be between 1 and N s for study s.

For the nth individual from study s, we will denote the vector of observed test responses as y s,n = (y s,n,1 , . . . , y s,n,T ) . Each test is either dichotomous or ordinal. For dichotomous tests, each observed test response, y s,n,t , is coded as 0 and 1 for negative and positive results, respectively. For ordinal tests, each test t has K t categories (hence K t − 1 cutpoints). We will use k as an index to refer to any given cutpoint, which can be between 1 and K t − 1. Ordinal test responses are coded according to the category that the individuals' test result falls in: in other words, y s,n,t = k if the test result falls in the kth category for test t. Since we are assuming an imperfect gold standard, the true disease status of each individual, d s,n is not defined by the results of the gold standard. Instead, it is modelled as an unknown (i.e. latent) variable, and belongs to one of two classes -'diseased' (d s,n = 1) or 'non-diseased' (d s,n = 0).

We will now define the MVP-LC model within each study. For dichotomous data, MVP-LC models work by transforming the observed, discrete test result data into a continuous variable -a statistical technique known as "data augmentation" 15 . This augmented continuous data, which we will denote as Z s,n , is an unobserved latent variable, similarly to the disease status, d s,n . This data augmentation process allows us to assume that, conditional on d s,n , the unobserved test accuracy data, Z s,n , can be modelled by a multivariate normal (MVN) distribution with mean vector ν denotes the study-specific correlations between each test-pair (or 'test-pair' -denoted as 't and t ') for the augmented data (Z s,n,t and Z s,n,t ) -the polychoric correlation 16,29 -which is not the same as the correlation between the observed data y s,n,t and y s,n,t . Each

s,1,t models the conditional dependence between each test-pair. However, we can assume conditional independence by setting [d] s,1,t = 0 and

s,1,t = 0. We must to ensure that the number of parameters being estimated from our model is not greater than what is possible for the given dataset; otherwise it may be non-identifiable -which means that the model will give misleading results. For example, it may estimate the sensitivity for a test to be equal to both both 0.20 and 0.80. To ensure that our model is identifiable 12, 16 , as in Xu et al 11 , we set each τ For dichotomous tests, the augmented data (Z s,n,t ) will be less than 0 for negative results (y s,n,t = 0) or greater than 0 for positive results (y s,n,t = 1), and the measures of test accuracy for a given study s are given by,

Where Φ(·) denotes the cumulative density function (CDF) of the standard normal distribution -that is, a normal distribution with mean 0 and standard deviation 1. For ordinal tests, the augmented data (Z s,n,t ) will belong to an interval defined by strictly increasing latent cutpoint parameters (

1,s,t , ..., C

k,s,t , and k between 2 and K t − 1). This interval will depend on the observed test result as follows -if the test result is below the first cutpoint (i.e. in the first category), then the augmented data will be less than the first cutpoint parameter; if it is above the last cutpoint (i.e., in the last category), then the augmented data will be greater than the last cutpoint parameter; otherwise, if the test result falls between two cutpoints (i.e., the test result belongs to any other category), then the augmented data will fall between the corresponding cutpoint parameters. The measures of test accuracy are given by,

Now we will explain how we will model the variation in test accuracy between studies -called the between-study heterogeneity, as well as the correlation between the sensitivities and specificity between studies -called the between-study correlation. It is important to bear in mind the distinction from the within-study correlations (defined in section 3.2), which model the conditional dependence between tests. For each test t, we will assume that the study-specific means (ν t , between-study standard deviations σ [d] t , and between-study correlations ρ t . More specifically, ν s,t ∼ BVN(µ t , Σ t ), where,

The model described in equation 4 is known as a partial pooling model (using the terminology from Gelman & Hill 30 -otherwise known as 'random-effects' ). These models allow the study-specific accuracy parameters across studies to inform one another, without assuming full homogeneity like a full pooling (i.e., "fixed-effects) would -which would allow no between-study variation in the means ν

s,t . The disease prevalence's in each study, p s , are modelled independently of each other, known as a no pooling model. There are several differences between partial pooling and no pooling models 31 . For example, the former uses less parameters than no pooling, which means that there is less likelihood of encountering parameter identifiability issues. An advantage of our partial pooling model is that allows us to summarise the results using the parameters which are shared across studies (see section 3.3), allowing us to more easily summarise test accuracy as well as the heterogeneity in accuracy between studies and correlation between sensitivities and specificities. We can incorporate meta-regression covariates into the model by extending the partial pooling model defined in equation 4 above -see supplementary material 1, meta-regression section (section 1.2.1) for details. We can assume that a given test is a perfect gold standard by setting µ

[0] t = −5 and µ [1] t = 5, which correspond to approximately 100% sensitivity and specificity, respectively, and, by assuming a complete pooling model (i.e. setting σ

We will model the within-study correlation matrices (Ψ 

. Note that we can also model the conditional dependence between only certain pairs of tests by setting the relevant terms for the other

The cutpoint parameters can be modelled using an induced Dirichlet model, an approach which has been proposed by Betancourt 33 By taking advantage of the properties of a type of statistical distribution called a Dirichlet distribution, this model is able to map the latent cutpoint parameters in each study ({C

Kt−1,s,t }) defined in section 3.2 to a simplex (i.e., a vector whose elements sum to 1) of ordinal probabilities (P

Kt,s,t ). Each probability P

k,s,t corresponds to the probability that an individuals' test result for test t falls in category k for study s. In this paper, we used a partial pooling model for the cutpoints accross studies, enabling us to model the between-study heterogeneity in the cutpoints. We can also obtain 'average' cutpoints, (C 

For dichotomous tests, the summary sensitivity and specificity estimates for test t are given by evaluating equation (2) at the means of the between-study model (see equation (1)). More specifically, Se G,t = Φ(µ

Similarly, for ordinal tests, the summary measures for test t at cutpoint k are given by evaluating equation (3) evaluated at the means of the partial pooling model (see equation (1)), and at the global (summary) cutpoints (C [1] k,t ). That is,

t ). We can generate predictions for a 'new' (S+1)-th study by simulating a draw (at each iteration of the parameter sampler) from the posterior predictive distributions of the between-study normal hierarchical model, (see (4)), ν S+1,t , and a new vector of cutpoints from the between-study cutpoint model (for more details, see section 1.2.4 in supplementary material 1). C

S+1,t ). Then, the predicted sensitivities and specificities for an (S + 1)-th study are given by Se S+1,t = Φ(ν

S+1,t ) for ordinal tests.

The summary estimates for the joint test accuracy of tests t and t at cutpoints k and k are given by: G,tt ,kk is the global conditional covariance between all possible test-pairs. Obtaining these covariances requires us to calculate the global conditional correlations between each test-pair, which we will denote as ρ G,tt ,kk ; this assumes that the test results are the same form as the observed data. However, our model is parameterised in terms of the polychoric correlations (˙

G,tt ,kk -see equation (equation 1) in section 3.2). Therefore, in order to be able to estimate the joint test accuracy estimates in equation 5, we will need to convert from˙

G,tt ,kk . For details on how this is achieved, please refer to section 1.2.4 of supplementary material 1.

For our MVP-LC model, We can check how well our model predicts the data by using a technique called posterior predictive checking -where we generate data from our model, and compare it to the observed data. For example, we can plot the model-predicted test results against the observed test results for each test-pair 5, 4, 6 . We also assessed model fit by plotting the model-predicted within-study correlations against the observed within-study correlations, using the correlation residual plot proposed by Qu et al 13 . For model comparison, we used leave-one-out (LOO) cross-validation 34 -an iterative procedure which removes part of the data and re-fits the model, and sees how well the model predicts the missing data. For more details on model comparison and posterior predictive checking, including relevant formulae, please refer to section 1.3 in supplementary material 1.

We implemented the models in R 35 using the probabilistic programming language Stan 36,37 via the R package CmDStanR 38 using a PC with 32GB of RAM and an AMD Ryzen 3900X 12-core CPU with Linux Mint OS. To code the model in Stan, we extended the code for a standard binary multivariate probit model 32 . This is described in detail in Goodrich 2017 39 and is summarised in supplementary material 5. We implemented the between-study partial pooling model for the within-study correlations described in section 3.3.1 in Stan by using the function provided by Stephen Martin and Ben Goodrich 40 . For the cutpoint between-study model, we used Betancourt's induced Dirichlet model 33 described in section 3.3.2; this is described in more detail in supplementary material 4, and this was implemented using code by Betancourt 33 .

We ran all models using 4 chains until the split R-hat statistic was less than 1.05 for all parameters and the number of effective samples was satisfactory for all parameters 41 . We only reported results when we obtained no warnings for divergent transitions or energy fraction of missing information (E-FMI), important diagnostics for geometric ergodicity 37 . We used the CmDStanR diagnostic utility to check all of the aformentioned model diagnostics 38 . We also inspected trace plots and plotted the posterior distributions to check they were not bimodal. Rather than using Φ(·), which is prone to numerical instability, we can use the closely resembling logistic function, Φ (x) = 1 1+e −1.702·x , which has an absolute maximum deviation from Φ(·) of 0.0095. This is the same probit approximation used for the meta-analysis of dichotomous tuberculosis tests using latent trait models in Sadatsafavi et al 8 . The data, Stan model code, and R code to reproduce the results and figures for the case study application in section 4 is provided at .

Since our model is Bayesian, we must formulate a prior model -that is, specify prior distributions for the model parameters defined in section 3. We describe this prior model in 4.1. When faced with the task of analysing a dataset with an imperfect gold standard which contains test accuracy data from an ordinal test, in order to be able to apply proposed methods for meta-analysis without assuming a gold standard 5,7 , one must first dichotomise the data at each cutpoint and conduct a series of stratified analyses. We applied a priori dichotomisation technique using our proposed MVP-model in section 4.2. Finally, in section 4.3, we applied the models proposed in section 3, but without dichotomising the Well's score.

In section 4.1, we will index the gold standard (ultrasound), the D-Dimer, and the Wells' score by t = 1, t = 2, and t = 3, respectively. In sections 4.2 and 4.3 we will denote summary estimates as "X [Y, Z]", where X is the posterior median and [Y, Z] is the 95% posterior interval.

For the summary-level accuracy parameter for the gold standard test (ultrasound -i.e. µ t=3 ), we specified priors conveying very little information -equivalent to assuming a 95% prior interval of (0.04, 0.96) for the sensitivities and specificities.

For the between study deviation parameters for all three tests (i.e., for σ [d] t for t = {1, 2, 3} -see equation (4) in section 3.3), we used weakly informative priors corresponding to a 95% prior interval of (0.02, 1.09). The priors are weakly informative since they weakly pull the study-specific sensitivities and specificities towards each other, whilst allowing for large between-study heterogeneity if the data demands. For example, if 0.8 is the value found for the summary sensitivity, and the data suggests a standard deviation equal to 2 (corresponding to a high degree of between-study heterogeneity), then these priors would allow the study-specific sensitivities and specificities to be in the interval (0.44, 0.97) with 95% probability. We also used weak priors for the between-study correlation parameters for all tests (i.e., ρ t for t = {1, 2, 3} -see equation (4 in section 3.3), corresponding 95% prior probability interval of (-0.82, 0.82). Finally, for conditional dependence models, for the within-study correlation parameters (see section 3.2 and equation 1), we used priors which correspond to 95% prior intervals of (−0.65, 0.65) for both the global 'average' correlation matrices (Ω 

We consider two dichotomisations of the Wells score. For the first, we dichotomised the Wells' score by grouping together those patients who obtain a score of 'low' or 'moderate' as a negative result and those who scored 'high' as positive. On the other hand, for the second dichotomisation. we grouped together patients who scored 'moderate' or 'high' and considered this as a positive result, and those who scored 'low' as a negative result. We will refer to the former dichtomisation as "low + moderate vs high" and the latter as "low vs moderate + high". We applied this technique to this dataset, , to allow comparison with our "full" model, using the models proposed in section 3, fitting both conditional independence (CI) and dependence (CD) models, the results of which are shown in in section The differences in the results were similar when modelling conditional dependence between the three tests (see figure 1 ). As with conditional independence, the specificity of the ultrasound and the sensitivity of the D-Dimer were similar between the two dichotomisations. We can also see the estimates of disease prevalence increase for most studies for the low vs moderate + high dichotomisation relative to the low + moderate vs high dichotomisation, for both conditional independence (left panel of figure 2 ) and dependence models (right panel of figure  2 ).

Overall, regardless of whether we assume conditional independence or dependence, some of the accuracy estimates change notably depending on how we dichotomise the Wells score. This is not surprising, since imperfect gold standard models based on latent class analysis utilise the full distribution of test responses from all tests to estimate accuracy and disease prevalence 4 . This simple example demon-strates the importance of modelling all the available data for ordinal non-dichotomous tests, such as the Wells score, in the presence of an imperfect gold standard, as opposed to simply conducting multiple stratified analyses at each cutpoint of the ordinal test using simpler methods. This observation serves to motivate the implementation of ordinal regression into the models to appropriately model the ordinal nature of the Wells score.

Now we fit the models without dichotomising the Wells score, by simultaneously modelling all three categories. For these models, we used weakly informative priors of µ 3 ∼ N(0, 1) for the mean parameters for the Wells test. We used the partial pooling model on the Wells score cutpoint parameters (see section 3.2, equation 3). For the Dirichlet population parameters, we used a weakly informative prior κ [d] ∼ N ≥0 (0, 50). This allows considerable asymmetry in the Dirichlet population vector α k , as can be seen from the prior predictive check (see figure 1 in section 1.3 in supplementary material 1). The rest of the priors were the same as those discussed in section 4.1.

We fit the following models: one assuming that ultrasound is a perfect gold standard and conditional independence between all three tests (M1); the same model but modelling the conditional dependence between the Well's score and D-Dimer (M2); a model assuming ultrasound to be an imperfect gold standard and conditional independence between all three tests (M3); and a variation of M3 which modelled the conditional dependence between all three tests (M4). The results for the summary sensitivity and specificity estimates for the four models are shown in figure  3 , and the results for each of the Wells score strata are shown in figure 4 . The estimates for the two models assuming a perfect gold standard (M1 and M2) are within 2% of those obtained from Novielli et al 17 . The similarity of the results is not surprising, since despite using different models and different link functions (logit vs approximate probit), both models assume that ultrasound is perfect.

For both the conditional independence (M1) and dependence (M2) models which assumed that ultrasound is a perfect reference test, the results we obtained for the accuracy of the BTP and BTN testing strategies for the Wells & D-Dimer tests were similar to those obtained by Novielli et al 17 .

More specifically, for the BTP testing strategy, we found that the summary specificity estimates for M1 (33 [25, 41] ) were around 8% lower than M2 (41 [32, 50] figure 3 ).

Whilst assuming conditional dependence, the model assuming ultrasound is perfect estimated the specificity of the Wells score to be around 6% lower than the conditional dependence model (52 [42, 63] and 58 [44, 72] for M2 and M4, respectively). We also found that the specificity of the D-Dimer was around 9% lower (63 [51, 73] and 72 [59, 83] for M2 and M4, respectively), that the specificity was around 5% lower for the Wells & D-Dimer BTP testing strategy (41 [32, 50] and 46 [34, 58] for M2 and M4, respectively), and that the specificity of the BTN testing strategy was around 10% lower (74 [65, 82] and 84 [75, 92] for M2 and M4, respectively).

The summary receiver operating characteristic plot for M4 is shown in figure 5 . The prediction regions suggest that there is substantial between-study heterogeneity for the sensitivity and specificity for most estimates. However, we found relatively narrow prediction regions for the specificity of ultrasound and for the Wells and D-Dimer BTP testing strategy, so we can be more confident in generalising our inferences for these estimates. 

Our proposed MVP-LC model addresses the novel problem of carrying out meta-analysis of two or more conditionally dependent tests when there is no perfect gold standard, for the case where there are both ordinal and dichotomous test(s) under evaluation, and estimation of joint test accuracy is of interest.

Using the case study as a demonstrative aid for the model (see section 4), we showed why treating ordinal tests as dichotomous in the context of an imperfect gold standard is suboptimal (see section 4.2). When we modelled the Wells test as ordinal and treated ultrasound as a perfect gold standard (see section 4.3), the summary estimates from Novielli et al 17 are replicated in our findings. However, we found that the most complex model -which treated ultrasound as an imperfect gold standard in addition to modelling the conditional dependence between tests -had the best fit to the data. For this model, our estimates of test accuracy differed considerably compared to other models we fit (which gave worse fit to the data) and compared to the results obtained in the analysis conducted by Novielli et al 17 . In particular, we obtained considerably different estimates of specificity for both the D-Dimer and the Wells score tests when used alone, and for the joint specificity of the Wells and D-Dimer BTN testing strategy. However, the large between-study heterogeneity limited the generalisability of our results.

The methods we have developed in this paper have a wide scope of applicability in clinical practice, further than just DVT. For instance, Hamza et at 42 re-analysed a meta-analysis 43 , which assessed the accuracy of the CAGE questionnaire 44 -a 4-category ordinal test used as a screening tool to detect individuals who may be suffering from alcoholism. However, their model assumed a perfect gold standard 1 . Our proposed MVP-LC model could be used to more appropriately estimate the accuracy of the CAGE questionnaire, since we would not need to assume that the reference test in each study is perfect.

The methods could also be used to more appropriately assess joint testing strategies. For instance, current UK Health Security Agency guidance 45 states that individuals who have symptoms suggestive of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and test positive using Lateral flow tests (LFTs) should be considered as positive and require no subsequent testing. On the other hand, it states that individuals with negative LFTs should be assessed with a polymerase chain reaction (PCR), with only those who also test negative on PCR being considered negative. Our methods could be used to investigate this joint test accuracy strategy without modelling PCR as an imperfect gold standard, particularly with respect to its sensitivity. For depression screening, one potential BTN testing strategy is one in which individuals undertake a very brief 2-item version of the Patient Health Questionnaire (PHQ-2 46 ) followed by the 9-item version (PHQ-9 47 ). This was investigated recently by Levis et al 48 ; however, they assumed perfect gold standards, and they only used around half of the available studies, since they discarded those studies which used inferior gold standards. Our MVP-LC model could be used to analyse these data without assuming a perfect gold standard whilst accounting for differences between reference tests with meta-regression, and using all of the available data. Furthermore, we would be able to model the differences in gold standards between studies using meta-regression (see section 3.3).

Our proposed Bayesian MVP-LC model addresses some important limitations which are present in models based on TLCMs 4,49,5,6,7 . For example, although TLCMs have fast run times due to being computationally inexpensive, and they can model the conditional dependence between tests 50 , an important limitation is that, unlike our proposed MVP-LC model, they cannot appropriately model ordinal tests. For example, if one wishes to simultaneously model ordinal tests whilst modelling conditional dependence, they would first need to dichotomise the data a priori. As we showed in section 4.2, this is suboptimal in the context of an imperfect gold standard, since the test accuracy and disease prevalence estimates were varied depending on which cutpoint we dichotomise the data at. A limitation of TLCMs which have been proposed for meta-analysis 5,6,7 is that, unless one assumes a complete pooling model between studies, it is not possible estimate summary correlation parameters -parameters which are required to estimate summary-level joint test accuracy. This is due to the fact that, in contrast to our MVP-LC model (which uses the within-study correlations), TLCMs model the conditional dependence using the within-study covariances, making it difficult to construct a partial pooling model for the within-study conditional dependence parameters. These covariances have bounds based on the sensitivity and specificity parameters in each study 49, 51 . Therefore, any summary-level covariance parameters obtained would be questionable. Our MVP-LC model also has advantages over more advanced models for meta-analysis of test accuracy, such as the model proposed by Sadatsafavi et al 8 , which is also based on multivariate probit regression and is an extension of the latent trait model 13 . Two important limitations of this model -not present in our MVP-LC model -is that it can only model dichotomous data, and it assumes that the within-study correlations are fixed across studies. Furthermore, since our proposed MVP-LC model is an extension of the model for single studies proposed by 11 , another benefit over the model by Sadatsafavi et al 8 is that it can also be used to specify more general correlation structures (by setting certain correlations to zero -see section 3.2).

The fact that our model is Bayesian means that one can incorporate subject-matter knowledge into the model, as we did for our case study. Furthermore, the Induced Dirichlet partial pooling model 37 (see section 3.3.2) and supplementary material 4) for the ordinal tests makes it possible to specify priors for ordinal tests and obtain summary estimates.

When applying the model to our case study dataset (see section 2), we used available subject-matter knowledge 24 to construct informative prior distributions for the gold standard test (ultrasound), and weakly informative priors for other parameters (see section4.1 and supplementary material 1). Attempts to conduct sensitivity analysis using more diffuse priors led to diagnostic errors. This is likely due to the fact that Stan is quite sensitive at detecting non-identifiabilities in the posterior distributions 36 , and non-identifiability is more likely to occur with less informative priors, particularly for latent class models due to the large number of parameters relative to the data. Another limitation of our case study analysis is that, although our model can easily incorporate meta-regression coefficients (see supplementary material 1), the case study dataset did not contain any study-level covariates, since primary studies did not report sufficient data. In an ideal world where such data were available, a more principled analysis could be carried out by using a meta-regression covariate for the proportion of patients who have proximal versus distal DVT, which would have enabled us to model the variation of ultrasound sensitivity that exists between the two DVT groups in Novielli et al's 17 data.

A limitation of our model, which is present across all imperfect gold standard methods based on latent class models (including TLCMs), is that full cross-classification tables (i.e. the full distribution of test results) are required for each study. This is a potential barrier to the uptake of our proposed MVP-LC model, as this data is frequently not reported for studies evaluating 3 or more tests and/or studies assessing ordinal tests. One way in which we could have assessed the general performance of our MVP-LC model is by running a simulation study 52 . A simulation study comparing our proposed MVP-LC model to other models would also be very useful. However, it is important to note that, at the time of writing, no other models have been proposed to simultaneously meta-analyse both dichotomous and ordinal tests without assuming a perfect gold standard. That being said, a simulation study would still be useful, since we could compare the performance of our model to other proposed models which do assume a perfect gold standard (e.g., Novielli et al 17 ) under a variety of different scenarios.

Although our proposed MVP-LC model offers considerable benefits in comparison to the more commonly used TLCM models 5,6,7,50 (see section 5.3), we found that our proposed model was considerably less time efficient than TLCM models. Although this was not prohibitive for the case study used in this paper2, our MVP-LC model may be intractable for larger sample sizes. Speeding up models based on augmented continuous data, such as our MVP-LC model, is an active area of research 53, 54, 55, 56, 57, 58, 59, 60 . An important area for future research would be to apply the models developed in this paper using these more efficient algorithms, which would make our proposed MVP-LC model more suitable for general use with larger meta-analyses, and it would also make it easier to conduct more meaningful simulation studies.

Models for the meta-analysis of test accuracy which can incorporate patient-level covariates -otherwise known as individual patient data (IPD) -have been proposed 61 , but only for dichotomous data and they assume a perfect gold standard 61 . Modelling IPD can lead to results which are more applicable to clinical practice as they can more easily be applied to patients when there is between-study heterogeneity, rather than only providing summary estimates which relate to some "average" patient. Extending our model to incorporate IPD would be relatively straightforward, since our model uses the patient-level data (as reconstructed from the reported contingency tables) as opposed to aggregated data for each study. It is straightforward to extend our model to the case where not all studies are assessing the same number of tests, using direct comparisons only. This could be further extended to allow indirect comparisons (network meta-analysis [NMA]), by assuming tests are missing at random (MAR) 62 , and extending the between-study model described in section 3.3 to an arm-based networkmeta analysis model 63, 64 . Another straightforward modelling extension would be to incorporate data from ordinal tests which have missing data for some categories.

Our model could also be extended to synthesize data from ordinal tests for the case where some (or all) studies do not report data for every cutpoint -which is common in research. One could formulate such a 'missing cutpoint' version of our MVP-LC model by extending the partial pooling between-study cutpoint model (see section 3.3.2 and supplementary material 4), and viewing the cutpoints as MAR. Another possible 'missing cutpoint' model could be constructed by modelling the cutpoint parameters as the same in the diseased and non-diseased classes, and assume that they are fixed between studies by using a no pooling model. Then, as opposed to our MVP-LC model, in which the within-study variances are set to 1 to ensure parameter identifiability (see section 3.2), the no pooling cutpoint model would allow us to introduce within-study variance parameters and model them using a partial pooling model without encountering significant identifiability issues. These within-study parameters could be set to vary between the two latent classes, which would result in a smooth, non-symmetric receiver operating characteristic (ROC) curve. Another possible 'missing cutpoint' approach would be one based on the model proposed by Dukic et al 65 , which assumes a perfect gold standard. This model also results in a smooth, non-symmetric ROC curve, since it assumes that the cutpoints vary between studies and are the same in the diseased and non-diseased class. However, it would be more parsimonious since it assumes that the sensitivity is some location-scale change of the false positive rate.

For the case where studies report thresholds at explicit numerical cutpoints (as is sometimes reported for continuous tests, such as biomarkers), some 'missing threshold' methods which assume a perfect gold standard have been proposed 66, 67 . Rather than modelling the cutpoints as parameters, these methods assume that the cutpoints are constants, equal to the value of the numerical cutpoint, and they estimate separate location and scale parameters in each study and disease class. Our MVP-LC model could be extended to achieve this without assuming a gold standard. An important area for future research would be to construct other models which can be used for the same purposes as our proposed MVP-LC model. For instance, a multivariate logistic regression model could be constructed by using the Bayesian multivariate logistic distribution proposed by O' Brien et al 68 . Such a model would use logistic link functions as opposed to probit (or approximate probit) links like our MVP-LC model, which are more numerically stable than probit and may give better fit to some datasets. Another multivariate regression approach would be to use copulas 69, 70, 71 . Besides multivariate regression based on augmented data, another approach to modelling conditionally dependent ordinal diagnostic tests without assuming a perfect gold standard is log-linear models 11 . These models can account for higherorder correlations 11 . However, this requires estimation of additional parameters, so it is likely to introduce identifiability issues. Similarly to the multivariate probit models utilised in this paper, it may be possible to extend these models to meta-analyse multiple, imperfect diagnostic tests with multiple cutpoints.

What is already known?: Standard, well-established methods exist for the synthesising estimates (i.e., conducting a meta-analysis) of test accuracy. These methods estimate test accuracy by comparing test results to some test which is assumed to be perfect -which is referred to as a 'gold standard' test. However, in clinical practice, these tests are often imperfect, which can cause estimates of the tests being evaluated to be biased and potentially lead to the wrong test being used in clinical practice. Meta-analytic methods, which do not assume a gold standard, have previously been proposed, but only for dichotomous tests.

What is new?: We developed a model which allows one to simultaneously meta-analyse ordinal and dichotomous tests without assuming a gold standard. The model also allows one to obtain summary estimates of the accuracy of two tests used in combination (i.e. joint test accuracy).

Potential impact for Research Synthesis Methods readers outside the authors field: The methods are widely applicable. For instance, psychometric measures and radiologic tests are typically ordinal, and the studies assessing these tests often do not use a gold standard; hence, applying standard models to these datasets may lead to misleading conclusions. The methods we proposed may lead to less biased accuracy estimates, and hence potentially a better understanding of which tests, and test combinations, should be used for these conditions.

Bivariate analysis of sensitivity and specificity produces informative summary measures in diagnostic reviews

A hierarchical regression approach to metaanalysis of diagnostic test accuracy evaluations

A unification of models for meta-analysis of diagnostic accuracy studies

Estimating the Error Rates of Diagnostic Tests

Random effects models in a meta-analysis of the accuracy of two diagnostic tests without a gold standard

Bayesian meta-analysis of diagnostic tests allowing for imperfect reference standards

Bayesian Meta-Analysis of the Accuracy of a Test for Tuberculous Pleuritis in the Absence of a Gold Standard Reference

A statistical method was used for the meta-analysis of tests for latent TB in the absence of a gold standard, combining random-effect and latent-class methods to estimate test accuracy

Statistical methods for the meta-analysis of diagnostic tests must take into account the use of surrogate standards

A probit latent class model with general correlation structures for evaluating accuracy of diagnostic tests

Evaluating accuracy of diagnostic tests with intermediate results in the absence of a gold standard

Probit Latent Class Analysis with Dichotomous or Ordered Category Measures: Conditional Independence/Dependence Models

Random Effects Models in Latent Class Analysis for Evaluating Accuracy of Diagnostic Tests

A model for evaluating sensitivity and specificity for correlated diagnostic tests in efficacy studies with an imperfect reference test

Bayesian Analysis of Binary and Polychotomous Response Data

Econometric analysis 7th Ed

Meta-analysis of the accuracy of two diagnostic tests used in combination: Application to the ddimer test and the wells score for the diagnosis of deep vein thrombosis

Deep vein thrombosis: pathogenesis, diagnosis, and medical management

Diagnosis, investigation, and management of deep vein thrombosis

Deep vein thrombosis

McMaster Diagnostic Imaging Practice Guidelines Initiative

ACR Appropriateness Criteria(®) on suspected lower extremity deep vein thrombosis

Diagnosis and Treatment of Lower Extremity Deep Vein Thrombosis: Korean Practice Guidelines

Systematic review and meta-analysis of the diagnostic accuracy of ultrasonography for deep vein thrombosis

Accuracy of diagnostic tests for clinically suspected upper extremity deep vein thrombosis: A systematic review

Value of assessment of pretest probability of deep-vein thrombosis in clinical management

Venous thromboembolism

Evaluating the cost-effectiveness of diagnostic tests in combination: is it important to allow for performance dependency?

A Generalized Definition of the Polychoric Correlation Coefficient

Data Analysis Using Regression and Multilevel/Hierarchical Models

Hierarchical Modeling

A better parameterization of the multivariate probit model

Ordinal Regression

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing

Stan: A probabilistic programming language

A Conceptual Introduction to Hamiltonian Monte Carlo

Author Rok Češnovar et al. CmDStanR: A lightweight interface to Stan for R users

Truncated Multivariate Normal Variates in Stan

Hierarchical prior for partial pooling on correlation matrices

Stan Modeling Language Users Guide and Reference Manual

Multivariate random effects meta-analysis of diagnostic tests with multiple thresholds

The value of the CAGE in screening for alcohol abuse and alcohol dependence in general clinical populations: a diagnostic meta-analysis

Detecting alcoholism. The CAGE questionnaire

Updated UK Health Security Agency guidance -confirmatory PCR tests to be temporarily suspended for positive lateral flow test results

The Patient Health Questionnaire-2: validity of a two-item depression screener

The PHQ-9: validity of a brief depression severity measure

Accuracy of the PHQ-2 Alone and in Combination With the PHQ-9 for Screening to Detect Major Depression: Systematic Review and Meta-analysis

The Effect of Conditional Dependence on the Evaluation of Diagnostic Tests

Modeling conditional dependence among multiple diagnostic tests

Bayesian Approaches to Modeling the Conditional Dependence Between Multiple Diagnostic Tests

Using simulation studies to evaluate statistical methods

Variational Inference: A Review for Statisticians

Scaling up Data Augmentation MCMC via Calibration

Transport Monte Carlo

Hamiltonian Monte Carlo using an adjoint-differentiated Laplace approximation: Bayesian inference for latent Gaussian models and beyond

Density estimation using Real NVP

Masked Autoregressive Flow for Density Estimation

Variational Inference with Normalizing Flows

Robust, Accurate Stochastic Optimization for Variational Inference

Meta-analysis of diagnostic test studies using individual patient data and aggregate data

Inference and Missing Data

A Bayesian hierarchical model for network meta-analysis of multiple diagnostic tests

ANOVA model for network meta-analysis of diagnostic test accuracy data

Meta-analysis of Diagnostic Test Accuracy Assessment Studies with Varying Number of Thresholds

Quantifying how diagnostic test accuracy depends on threshold in a meta-analysis

Modelling multiple thresholds in meta-analysis of diagnostic test accuracy studies

Bayesian Multivariate Logistic Regression

COPULA BIVARIATE PROBIT MODELS: WITH AN APPLICATION TO MEDICAL EXPENDITURES

Dynamic copula based multivariate discrete choice models with applications

The Bivariate Normal Copula

The authors would like to thank Elpida Vounzoulaki for proofreading the manuscript. The authors would also like to thank various members of the Stan community forums (see https://discourse.mcstan.org/) including Ben Goodrich, Michael Betancourt, Stephen Martin, Staffan Betnér, Martin Modrák, Niko Huurre, Bob Carpenter and Aki Vehtari and for providing functions which were utilised in the models and for useful discussions. 

The Data, R and Stan code to reproduce the results and figures from section 4 is available on Github at: https://github.com/CerulloE1996/dta-ma-mvp-1.