key: cord-0019976-6adcei2b
authors: Bernhardt, Lizelle; Brady, Emer M.; Freeman, Suzanne C.; Polmann, Helena; Réus, Jéssica Conti; Flores-Mir, Carlos; De Luca Canto , Graziela; Robertson, Noelle; Squire, Iain B.
title: Diagnostic accuracy of screening questionnaires for obstructive sleep apnoea in adults in different clinical cohorts: a systematic review and meta-analysis
date: 2021-08-18
journal: Sleep Breath
DOI: 10.1007/s11325-021-02450-9
sha: f33a996c6310f9445849a35d249b8458ef254f9a
doc_id: 19976
cord_uid: 6adcei2b

PURPOSE: The majority of individuals with clinically significant obstructive sleep apnoea (OSA) are undiagnosed and untreated. A simple screening tool may support risk stratification, identification, and appropriate management of at-risk patients. Therefore, this systematic review and meta-analysis evaluated and compared the accuracy and clinical utility of existing screening questionnaires for identifying OSA in different clinical cohorts. METHODS: We conducted a systematic review and meta-analysis of observational studies assessing the diagnostic value of OSA screening questionnaires. We identified prospective studies, validated against polysomnography, and published to December 2020 from online databases. To pool the results, we used random effects bivariate binomial meta-analysis. RESULTS: We included 38 studies across three clinical cohorts in the meta-analysis. In the sleep clinic cohort, the Berlin questionnaire’s pooled sensitivity for apnoea-hypopnoea index (AHI) ≥ 5, ≥ 15, and ≥ 30 was 85%, 84%, and 89%, and pooled specificity was 43%, 30%, and 33%, respectively. The STOP questionnaire’s pooled sensitivity for AHI ≥ 5, ≥ 15, and ≥ 30 was 90%, 90%, and 95%, and pooled specificity was 31%, 29%, and 21%. The pooled sensitivity of the STOP-Bang questionnaire for AHI ≥ 5, ≥ 15, and ≥ 30 was 92%, 95%, and 96%, and pooled specificity was 35%, 27%, and 28%. In the surgical cohort (AHI ≥ 15), the Berlin and STOP-Bang questionnaires’ pooled sensitivity were 76% and 90% and pooled specificity 47% and 27%. CONCLUSION: Among the identified questionnaires, the STOP-Bang questionnaire had the highest sensitivity to detect OSA but lacked specificity. Subgroup analysis considering other at-risk populations was not possible. Our observations are limited by the low certainty level in available data. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1007/s11325-021-02450-9.

With an estimated 425 million individuals affected worldwide, clinically important obstructive sleep apnoea (OSA) poses a global public health problem [1] . Characterised by upper airway collapse, exaggerated negative intrathoracic pressure, oxidative stress, and systemic inflammation, OSA is associated with significant cardiovascular and metabolic complications, including hypertension, stroke, heart failure, and diabetes [2] [3] [4] [5] [6] [7] .

Despite the high prevalence and associated sequelae, most individuals with OSA remain undiagnosed, posing a significant risk to the individual patient and health care systems as complications develop [1, [8] [9] [10] . Barriers to the diagnosis and treatment of OSA are multifaceted and include geographical variation and inequity in the availability of sleep services and access to polysomnography (PSG), often limited by cost and long waiting times [11] .

To support risk stratification and appropriate referrals in individuals at-risk, a simple and reliable screening tool may help triage patients at risk of OSA, for consideration of referral to specialist services for appropriate management [12] [13] [14] . Clinical prediction formulae have been developed but are limited by complexity and the requirement for a computer or mathematical calculations [15] . In contrast, OSA screening questionnaires are less complicated and may be a viable alternative to clinical prediction formulae in specific settings.

To date, there have been four systematic reviews exploring the accuracy of OSA screening tools in adults [12, [16] [17] [18] . One of the first systematic reviews and meta-analyses to explore the accuracy of screening tools for OSA identified four screening questionnaires; however, due to heterogeneity pertaining to the questionnaire, OSA definition, and threshold, these were not meta-analysed [16] . Ramachandran [17] reported that clinical prediction models performed better than the eight questionnaires studied to predict OSA in pre-operative cohorts. Abrishami [12] focused on a 'sleep disorder' cohort and a cohort 'without a history of sleep disorders'. It was concluded that questionnaires were useful for early detection of OSA, especially in the surgical population. Despite finding it difficult to draw a definite conclusion about questionnaire accuracy, the STOP and STOP-Bang questionnaires were recommended for screening in a surgical population [12] . Recently, Chui [18] compared the diagnostic accuracy of the Berlin, STOP-Bang, STOP, and Epworth Sleepiness Scale. In line with Abrishami [12] , they reported the STOP-Bang to have the highest sensitivity in both the sleep clinic and surgical populations.

Since the publication of these systematic reviews, new OSA screening questionnaires have emerged, further validation studies conducted, and different clinical settings and patient cohorts considered. As test performance often varies across clinical cohorts, it is recommended that tools are evaluated in clinically relevant cohorts [19] . Hence, the objective of this systematic review and meta-analysis was to evaluate the accuracy and clinical utility of existing questionnaires, when used alone, as screening tools for the identification of OSA in adults in different clinical cohorts.

The protocol was registered at the International Prospective Register of Systematic Reviews (PROSPERO) (CRD42018104018) and conducted according to the Preferred Reporting Items for Systematic Reviews and Metaanalysis (PRISMA) guidelines [20] .

We included observational studies that met the following eligibility criteria:

Inclusion criteria: (1) prospective studies measuring the diagnostic value of screening questionnaires for OSA; (2) studies in adults (> 18 years of age); (3) studies in which the accuracy of the questionnaire was validated by level one or two PSG; (4) OSA was defined as apnoea-hypopnoea index (AHI) or Respiratory Disturbance Index (RDI) > 5; (5) data allowed for construction of 2 × 2 contingency tables; (6) publication in English, Spanish, or Portuguese.

Exclusion criteria: (1) studies measuring the diagnostic value of clinical scales, scores, and prediction equations as screening tools for OSA; (2) conference proceedings, reviews, or case reports; (3) insufficient data for analysis after several attempts to contact the author; (4) studies in children (< 18 years of age); (5) level three and four portable studies were used as the reference standard; (6) studies conducted in in-patient settings; (7) publication language is other than English, Spanish, or Portuguese.

Index test: the test under evaluation was only OSA screening questionnaires (self-reported or clinician completed).

Reference standard: the reference standard was a level one or two PSG.

Target conditions: the target condition was OSA, defined as AHI or RDI.

• AHI/RDI ≥ 5-diagnostic cut-off for OSA • AHI/RDI ≥ 15-diagnostic cut-off for moderate to severe OSA • AHI/RDI ≥ 30-diagnostic cut-off for severe OSA

Comprehensive literature searches in CINAHL PLUS, Scopus, PubMed, Web of Science, and the Latin American and Caribbean Health Sciences Literature (LILACS) database were conducted from inception to 18 December 2020. Detailed individual search strategies (Online Resource 1 & 2), with appropriate truncation and word combinations, were developed for each database. Additional records were identified from grey literature sources comprising ETHos, OpenGrey, Google Scholar, ProQuest, and New York Grey Literature Report. The reference lists from the final articles for analysis and related review articles were manually searched for references that could have been omitted during the electronic database searches.

Two reviewers (LB, EB) screened the titles and abstracts of the electronic search results independently to identify studies eligible for inclusion in the review. Records classified as 'excluded' by both reviewers were excluded. The full text of any study about which there was disagreement or uncertainty was assessed independently against the selection criteria and resolved through discussion and consultation with a third reviewer (IS or NR). Duplicates were identified and excluded before recording the selection process in sufficient detail to complete the PRISMA flow diagram and tables describing the characteristics of the excluded studies (Online Resource 3) [20] .

Two reviewers (LB, EB) independently conducted data extraction on all studies included and extracted the data required to reconstruct the 2 × 2 contingency tables, including true positive (TP), false positive (FP), true negative (TN), and false negative (FN) values. Where these values were not documented, we extrapolated the values from equations when data allowed. A data collection form tailored to the research question and fulfilling the data entry requirements of MetaDTA (Diagnostic Test Accuracy Meta-Analysis v1.43) was utilised [21] .

HP and JR extracted the study characteristics and demographic data for all included studies, and LB and EB entered the data into Review Manager 5.3 [22] .

No studies with inconclusive results were identified.

The quality of studies included was appraised independently by the reviewers (LB, EB) utilising the Quality Assessment for Diagnostic Accuracy Studies tool (QUADAS-2) with disagreements resolved through consultation with a third reviewer (IS or NR) [23] .

Statistical analysis was performed according to "Chapter 10" of the Cochrane Handbook for Systematic Review of Diagnostic Test Accuracy [24] . Questionnaire screening was considered positive for OSA if the questionnaire score was above the defined threshold specified in the primary study and negative if the questionnaire score was below the defined threshold. The TP, FP, TN, and FN results were produced by cross-classifying the questionnaire results with those of the PSG results. These were based on the ability of screening questionnaires to classify and detect OSA correctly.

The sensitivity and specificity of individual studies were calculated using 2 × 2 contingency tables and presented as forest plots. The meta-analysis was conducted using MetaDTA version 1.43, which models sensitivity and specificity by fitting the random effects bivariate binomial model of Chu and Cole [25, 26] . The summary receiver operating characteristic (SROC) plot was drawn using the hierarchical SROC parameters, which are estimated from the bivariate model parameters using the equivalence equations of Harbord [27] . Following guidance from the Cochrane Handbook for Systematic Review of Diagnostic Test Accuracy, we did not pool the positive and negative predictive values due to the prevalence of OSA varying across studies [24] .

As per the Cochrane DTA handbook, we investigated heterogeneity by plotting the observed study results and SROC curve in the ROC space alongside the 95% confidence region [24] .

We conducted a meta-regression to investigate differences in sensitivity and specificity between questionnaires, including the type of questionnaire as a covariate. Metaregression was conducted in R version 4.0.1 using the lme4 package [28] .

To assess the robustness of the meta-analysis, sensitivity analyses were conducted by excluding studies based on their QUADAS-2 assessment score [23] . Those identified as high risk in any QUADAS-2 domain or as unclear in four domains were excluded. Different AASM (American Academy of Sleep Medicine) scoring criteria and desaturation (and arousal) thresholds were applied to the included studies. We conducted additional sensitivity analyses by analysing studies that applied the ≥ 3% desaturation scoring criteria together and those that applied the ≥ 4% desaturation scoring criteria (summarised in Table 1) .

We neither explored reporting bias, nor assessed publication bias due to the uncertainty about the determinants of publication bias for diagnostic accuracy studies, and the inadequacy of tests for detecting funnel plot asymmetry [74] .

Search results are summarised in Fig. 1 .

Of 45 studies, 29 were included for meta-analysis in the sleep clinic population (n = 10,951), 7 were included for meta-analysis in the surgical population (n = 2275), and 2 were included in the resistant hypertension population (n = 541). The remaining 7 studies were excluded from the meta-analysis due to heterogeneity of included populations. Study characteristics and demographic data of the included studies are summarised in Tables 1 and 2 . Overall, 10 clinical settings were identified, of which the sleep clinic, surgical, and resistant hypertension cohorts had sufficient studies for inclusion in the meta-analysis.

OSA obstructive sleep apnoea, AHI apnoea-hypopnoea index, RDI respiratory disturbance index, Lab laboratory, PSG polysomnography, AASM American Academy of Sleep Medicine.

SD standard deviation, kg kilogramme, m metre, cm centimetre, NC neck circumference, WC waist circumference, AHI apnoea-hypopnoea index, n/a not applicable.

Results of the QUADAS-2 assessment are summarised in Fig. 2 and Online Resource 4 .

In the patient selection domain, 3 studies were rated as high risk of bias due to the case-control study design. For both the index test and reference standard domains, 18 studies were rated as unclear risk of bias due to inadequate information related to blinding; it was unclear if the index test and reference standard findings were interpreted without the knowledge of the other. Thirty-four studies were rated as unclear risk of bias in the flow and timing domain due to lack of reporting on the time interval between the index test and the reference standard. Applicability was rated as low risk in all 45 studies.

In the sleep clinic population (N = 10,951) (Fig. 3) , the Berlin (score cut-off ≥ 2) (Online Resource 5), STOP (score cut-off ≥ 2), and STOP-Bang (score cut-off ≥ 3) (Online Resource 6) questionnaires were included in the meta-analysis [58, 75] . The ASA checklist, SA-SDQ, and STOP-Bang (cut-off ≥ 5) questionnaires were excluded due to insufficient studies.

The prevalence of AHI ≥ 5 (all OSA), AHI ≥ 15 (moderate to severe), and AHI ≥ 30 (severe) OSA was 84%, 64%, and 50% respectively. The pooled sensitivity of the Berlin questionnaire to predict all OSA, moderate-severe, and severe OSA was 85% (95% confidence interval (CI): 79%, 89%), 84% (95% CI: 79%, 89%), and 89% (95% CI: 80%, 94%) respectively. Pooled sensitivity remained consistent across OSA severity. Pooled specificity was 43% (95% CI: 30%, 58%), 30% (95% CI: 20%, 41%), and 33% (95% CI: 21%, 46%) respectively. The corresponding diagnostic odds ratio (DOR) were 4.3 (95% CI: 0.7, 7.8), 2.3 (95% CI: 1.3, 3.3), and 3.9 (95% CI: 2.1, 5.7) (Fig. 4 , Table 3 ). Predictive parameters of the STOP questionnaire (score cut-off ≥ 2)

The prevalence of AHI ≥ 5 (all OSA), AHI ≥ 15 (moderate to severe), and AHI ≥ 30 (severe) OSA was 67%, 58%, and 46% respectively. The pooled sensitivity of the STOP questionnaire to predict all OSA, moderate-severe, and severe OSA was 90% (95% CI: 82%, 95%), 90% (95% CI: 75%, 97%), and 95% (95% CI: 88%, 98%) respectively. The pooled specificity was 31% (95% CI: 15%, 53%), 29% (95% CI: 10%, 61%), and 21% (95% CI: 10%, 39%) respectively. The corresponding DOR were 4.2 (95% CI: 0.8, 7.6), 3.8 (95% CI: 1.7, 5.9), and 4.7 (95% CI: 2.6, 6.8) respectively (Fig. 5 , Table 3 ). Greater uncertainty and variability in specificity were noted in the CI width and scatter of individual study estimates.

The prevalence of AHI ≥ 5 (all OSA), AHI ≥ 15 (moderate to severe), and AHI ≥ 30 (severe) OSA was 80%, 59%, and 39%, respectively. The pooled sensitivity of the STOP-Bang questionnaire to predict all OSA, moderate-severe, and severe OSA was 92% (95% CI: 87%, 95%), 95% (95% CI: 92%, 96%), and 96% (95% CI: 93%, 98%) respectively. The pooled specificity was 35% (95% CI: 25%, 46%), 27% (95% CI: 18%, 34%), and 28% (95% CI: 20%, 38%) respectively. The corresponding DOR were 6.0 (95% CI: 4.4, 7.6), 6.4 (95% CI: 3.3, 9.5), and 9.2 (95% CI: 5.9, 12.4) respectively (Fig. 6 , Table 3 ). Greater uncertainty and variability in specificity were noted in the CI width and scatter of individual trial estimates, particularly for AHI ≥ 5. SROC plots were used to display the results of individual questionnaires in the ROC space, plotting each questionnaire as a single sensitivity-specificity point [24] . When we plotted the SROC for all three questionnaires on the same axes, the confidence regions of the Berlin, STOP, and STOP-Bang questionnaires, for all OSA (AHI ≥ 5) (Fig. 7) and severe OSA (AHI ≥ 30) (Fig. 9) , overlapped, suggesting that there was no statistically significant difference in sensitivity among the 3 questionnaires. Figure 8 shows no overlap of the confidence regions for the Berlin and STOP-Bang questionnaires, suggesting a possible difference in sensitivity between the two questionnaires. A meta-regression model assuming equal variances for logit sensitivity and logit specificity suggested that the expected sensitivity or specificity differed between the two tests (chi-square = 14.1, 2df, p = 0.0008) (Fig. 9 ). 

In the surgical population (n = 2710) (Fig. 10) , we identified the Berlin, STOP, and STOP-Bang questionnaires for inclusion in the meta-analysis. The ASA checklist and OSA50 questionnaires were excluded from meta-analysis due to an insufficient number of studies. Nunes included two surgical cohorts, abdominal and coronary artery bypass grafting, which were entered as separate cohorts [63] .

Two studies were included in the meta-analysis of the Berlin Questionnaire for moderate to severe OSA (AHI ≥ 15) (Fig. 11 ). Due to insufficient data, we were unable to conduct a meta-analysis for all (AHI > 5) and severe OSA (AHI > 30). The prevalence of moderate to severe OSA or AHI of ≥ 15 was 42%. The pooled sensitivity of the Berlin questionnaire to predict moderate to severe OSA (AHI ≥ 15) was 76% (95% CI: 66%, 84%), and the pooled specificity was 47% (95% CI: 32%, 62%). The DOR was 2.9 (95% CI: 0.2, 5.5) ( Table 4 ).

Two studies were eligible for inclusion in the STOP questionnaire meta-analysis for moderate to severe OSA (AHI ≥ 15). However, due to insufficient studies and large heterogeneity around the specificity, the STOP questionnaire was excluded from the meta-analysis (Fig. 12) .

We included 6 studies in the meta-analysis of the STOP-Bang questionnaire for moderate to severe OSA (AHI ≥ 15) (Fig. 13) .

The prevalence of AHI ≥ 5 (all OSA), AHI ≥ 15 (moderate to severe), and AHI ≥ 30 (severe) OSA was 72%, 33%, and 21%, respectively. The pooled sensitivity of the STOP-Bang questionnaire to predict all OSA, moderate-severe, and severe OSA was 85% (95% CI: 81%, 88%), 90% (95% CI: 87%, 93%), and 96% (95% CI: 92%, 98%) respectively. The pooled specificity was 40% (95% CI: 30%, 50%), 27% (95% CI: 19%, 37%), and 26% (95% CI: 21%, 46%). The corresponding DOR were 3.6 (95% CI: 2.3, 4.8), 3.4 (95% CI: 1.9, 4.9), and 8.4 (95% CI: 2.7, 14.2), respectively (Table 4 ). Compared to the Berlin and STOP questionnaires, individual trial estimates of sensitivity appeared to be more homogeneous for the STOP-Bang questionnaire (Figs. 11, 12, and 13) .

In the surgical population, two of six studies reported data at multiple cut-off points for the STOP-Bang questionnaire for moderate-to-severe OSA (AHI ≥ 15) [62, 63] . Increasing the threshold from 4 to 7 increased specificity from 31% (95% CI: 0.2, 0.4) to 96% (95% CI: 0.89, 0.99) and was greatest at cut-off values ≥ 6 and ≥ 7 (Table 5) . However, increase in specificity was at the expense of a reduction in sensitivity. 

We included 2 studies (n = 517) in the meta-analysis of the Berlin questionnaire (cut-off ≥ 2) for all OSA (AHI of ≥ 5) [65, 66] . Due to insufficient study data, we were unable to conduct a meta-analysis for moderate-severe (AHI > 15) and severe OSA (AHI > 30).

The prevalence of all OSA or an AHI of ≥ 5 was 80%. The Berlin questionnaire's pooled sensitivity to predict all OSA or AHI of ≥ 5 was 80% (95% CI: 60%, 92%), and the pooled specificity was 36% (95% CI: 21%, 55%). The DOR was 2.2 (95% CI: 0.7, 3.8).

Asthma, community clinic, highway bus drivers, neurology clinic, primary care, respiratory and snoring clinic cohorts were identified but were excluded from the meta-analysis due to having only one study per cohort (Online Resource 7) [67] [68] [69] [70] [71] [72] [73] .

No studies were evaluated as high risk in the surgical and resistant hypertension populations; therefore, no sensitivity analyses were conducted.

In the sleep clinic population, sensitivity analyses were conducted for the Berlin (Online Resource 8), STOP-Bang (Online Resource 9), and the STOP questionnaires for AHI > 5, AHI ≥ 15, and AHI ≥ 30 (Online Resource 10) excluding studies identified as high risk in any QUADAS-2 domain, unclear in four domains or outliers. We excluded one study for the STOP questionnaire for AHI > 5 [49] , AHI ≥ 15 [44] , and AHI ≥ 30 [44] . For the STOP-Bang questionnaire, we excluded five studies for AHI > 5 [29-31, 34, 46] and four studies for AHI ≥ 15 [30, 35, 38, 44] and AHI ≥ 30 [30, 35, 38, 44] . For the Berlin questionnaire AHI > 5 [45, 46, 53, 55] and AHI ≥ 15 [45, 46, 53, 55] , we excluded four studies, and for an AHI ≥ 30 [44, 45, 55] , we excluded three studies.

Across all three questionnaires, exclusion of studies was associated with stable or slightly increased sensitivity. In contrast, sensitivity analysis was associated with reduced specificity (Online Resources 8-10). The STOP-Bang questionnaire remained the most effective questionnaire with the highest sensitivity compared to the Berlin and STOP questionnaires. Specificity among all three questionnaires remained low.

Due to an insufficient number of studies, no sensitivity analysis was conducted in the resistant hypertension population.

In the surgical population, the Berlin and STOP questionnaire studies utilised the ≥ 3% desaturation scoring criteria; therefore, no sensitivity analyses were conducted. For the STOP-Bang questionnaire, studies applied either ≥ 3% or ≥ 4% desaturation criteria. When we applied the ≥ 3% desaturation criteria to the STOP-Bang questionnaire, we excluded one study for AHI > 5 [60] , two studies for AHI ≥ 15 [60, 64] , and one study for AHI ≥ 30 [60] . In turn, when we applied the ≥ 4% desaturation criteria, we excluded four studies for AHI ≥ 15 [59, [61] [62] [63] . Across the three AHI thresholds, sensitivity remained stable, compared to a stable or slightly decreased sensitivity with application of the ≥ 3% desaturation criteria. For AHI ≥ 15, application of the ≥ 4% desaturation criterion was associated with a slight reduction in sensitivity and an increase in specificity (Online Resource 11).

We conducted a sensitivity analysis in the sleep clinic population for the Berlin, STOP, and STOP-Bang questionnaires, applying both the ≥ 3% and ≥ 4% desaturation criteria respectively. Studies were excluded on the basis of high risk of bias, scoring criteria not specified, and desaturation criteria (≥ 3% or ≥ 4%) (Online Resource 12) .

Across all three questionnaires in the sleep clinic population, exclusion of studies was associated with stable sensitivity and reduced specificity, particularly when applying the ≥ 4% desaturation criterion (Online Resources 13, 14, 15) . Overall, the STOP-Bang questionnaire remained the most effective questionnaire with the highest sensitivity compared to the Berlin and STOP questionnaires. Specificity among all three questionnaires remained low.

This systematic review and meta-analysis investigated questionnaires' accuracy and clinical utility as screening tools for OSA in adults in different clinical cohorts.

Consistent with previous studies, our findings showed that the STOP-Bang questionnaire (score cut-off ≥ 3) suggested the highest sensitivity to detect OSA and the highest diagnostic odds ratio in both the sleep clinic and surgical populations [12, 18, 76] . However, the STOP-Bang questionnaire was limited by consistently low specificity across all AHI thresholds, resulting in high false positive rates. The Berlin questionnaire (score cut-off ≥ 2) appeared to be the least useful, demonstrating overall low sensitivity and low specificity across all three cohorts [12, 18, 77] . Although there was no comparison with other questionnaires in the resistant hypertension cohort, findings were comparable with the sleep clinic and surgical cohorts.

OSA screening questionnaires are intended to provide the information required to identify patients most likely to benefit from downstream management decisions, such as onward referral for objective sleep testing and possible treatment following a positive full diagnostic test. The potential utility of OSA screening questionnaires in risk stratification of patients has been demonstrated in several cohorts. Not only has OSA been associated with risk of peri-operative complications and consequent longer length of hospital stay, but it has also been linked to poor clinical outcomes including higher rates of post CABG atrial fibrillation [78] [79] [80] . In the context of the ongoing coronavirus disease 2019 (Covid-19) pandemic, a recent study reported worse clinical outcomes in patients with Covid-19 classified by the Berlin questionnaire as high risk, compared to those at low risk, of OSA [81] . The study also highlighted the challenges with objective assessment of OSA with PSG during the Covid-19 pandemic, emphasising the need for alternative approaches beyond PSG, such as validated screening questionnaires. In this context, we would encourage the assessment and validation of OSA screening questionnaires, in particular STOP-Bang, as screening tools for risk stratification appropriate clinical settings, with the aim of improving outcomes for patients.

Although sensitivity and specificity provide us with the necessary information to discern between the available screening questionnaires, the clinical value and application of the screening questionnaires are demonstrated by means of the positive and negative predictive values which are dependent on the prevalence of the disease in the given clinical population. Although we were unable to pool the predictive values of individual questionnaires due to variation in prevalence across studies, the point estimates of PPV and At the same time, the low specificity of the STOP-Bang questionnaire (and therefore its relative inability to correctly identify patients without OSA) leads to a high rate of false positive findings; this may have emotional and cognitive implications for individual patients with added consequences for clinical services, not least cost [80, 82] .

This systematic review's main strength lies in our comprehensive literature search with stringent eligibility criteria to identify all relevant studies reporting on the accuracy and clinical utility of existing OSA screening questionnaires that were validated against the gold standard PSG. Our inclusion of the LILACS database expanded our search to include Latin America and the Caribbean studies. Of previous reports, the review by Ramachandran [17] was limited to a search of two databases, English publications only, and omitted any grey literature sources in their search strategy. Additionally, it was unclear if Ross [16] and Abrishami [12] included any grey literature sources in their searches.

Two independent reviewers completed data extraction, and we used the QUADAS-2 tool to assess rigorously all included studies for risk of bias. To evaluate the robustness of the meta-analysis, we conducted sensitivity analyses to investigate the potential influence on our findings from studies at high, or unclear, risk of bias. Although our study did not explore source differences from an ethnicity or geographical perspective, we conducted a further sensitivity analysis to evaluate the impact of varying scoring criteria on our study findings. The utilisation of different AASM scoring criteria and desaturation (and arousal) thresholds across studies created a source of variability [83] [84] [85] . Although the definition for apnoeas remained stable, there has been much controversy about the definition of hypopnoeas, specific to flow reduction, oxygen desaturation, and the presence or absence of arousal [86] . Varying definitions of hypopnoea not only impacts on prevalence estimates but is likely to underestimate OSA in patients who may benefit from treatment [86] . A study by Guilleminault et al. (2009) showed that by using the 30% flow reduction and 4% desaturation without arousal criteria would have missed 40% of patients who were identified using the criteria with arousal and who were responsive to CPAP therapy with reduction in AHI and symptomatic improvement [87] .

On this background, our review is based on a larger number of studies than prior analyses [12, 16, 17] . Although the review by Chiu [18] encompassed a larger dataset, that report carried a greater risk of bias due to the inclusion of retrospective studies and studies that used PSG and portable monitoring as the reference standard.

This review considered all existing OSA screening questionnaires for inclusion. In contrast, Chui [18] pre-selected four questionnaires, including the ESS, which was not developed as a screening questionnaire, but as a measure of daytime sleepiness. Similar to Abrishami [12] and Chui [18] , our review focused on questionnaires only, in contrast to Ross [16] and Ramachandran [17] , who also included portable monitoring and clinical prediction tools, respectively.

There are a number of limitations to this work. Our findings are influenced by the limitations of the included studies. In several, the true risk of bias was unclear in several of the QUA-DAS-2 domains due to underreporting in the index test, reference standard, and flow and timing domains. Similarly, it was often unclear if the results of the index test and the reference standard were interpreted independently. Very few studies provided adequate information to determine if the time interval between the index test and the reference standard was appropriate.

Our decision to exclude seven additional clinical cohorts may be considered a limitation; however, in the context of unclear, and possibly substantial, differences among these studies in the patient spectrum and disease prevalence, we felt it appropriate not to include these in the meta-analysis. Because the accuracy of screening tools varies according to the spectrum of disease, this further reiterates the need for validation studies in similar clinical cohorts.

There was a high degree of heterogeneity among included studies with the possibility of selection bias, especially in the sleep clinic population. Consequently, reported sensitivity estimates will be higher than lower-risk populations, making it difficult to extrapolate the true utility of the questionnaire in clinical practice.

In conclusion, our review investigated the accuracy and clinical utility of existing OSA screening questionnaires in different clinical cohorts. While the STOP-Bang questionnaire had a high sensitivity to detect OSA in both the sleep clinic and surgical cohorts, it lacked adequate specificity. This review highlights the issue of low specificity across OSA screening questionnaires. Research is required to explore reasons for low specificity and strategies for improvement, ideally without reducing sensitivity. The validation of screening questionnaires in sleep clinic populations is limited by possible selection and spectrum bias, reiterating the need for diagnostic validation studies in clinically similar cohorts. Additionally, further research is needed in resistant hypertension and other at-risk populations that we could not include in the meta-analysis. Improvement in the conduct and reporting of diagnostic validation studies must ensure quality and low risk of bias. Finally, to enable the extrapolation of the true accuracy and clinical utility of screening questionnaires, validation studies of high methodological quality in comparable, clinically relevant cohorts are required.

Estimation of the global prevalence and burden of obstructive sleep apnoea: a literature-based analysis

Sleep-disordered breathing and cardiovascular disease: crosssectional results of the Sleep Heart Health Study

Systemic hypertension and obstructive sleep apnoea

Obstructive sleep apnoea and its cardiovascular consequences

Obstructive sleep apnea and heart failure: pathophysiologic and therapeutic implications

Obstructive sleep apnoea: a cardiometabolic risk in obesity and the metabolic syndrome

Cardiovascular consequences of OSA

Primary care supplement: the big sleep problem

Sleep-related breathing disorders

Obstructive sleep apnea: personal, societal, public health, and legal implications

Obstructive sleep apnoea health economics. Consulting Report for the British Lung Foundation

A systematic review of screening questionnaires for obstructive sleep apnea/Une revue méthodique des questionnaires de dépistage de l'apnée obstructive du sommeil

Potential underdiagnosis of obstructive sleep apnoea in the cardiology outpatient setting

Barriers to diagnosis and treatment of sleep disordered breathing in patients with heart failure: perception and experiences of healthcare professionals

The use of clinical prediction formulas in the evaluation of obstructive sleep apnea

Systematic review and meta-analysis of the literature regarding the diagnosis of sleep apnea

A meta-analysis of clinical screening tests for obstructive sleep apnea

Diagnostic accuracy of the Berlin questionnaire, STOP-BANG, STOP, and Epworth sleepiness scale in detecting obstructive sleep apnea: a bivariate metaanalysis

Spectrum bias or spectrum effect? Subgroup variation in diagnostic test evaluation

Updating guidelines for reporting systematic reviews and meta-analyses: development of the PRISMA 2020 statement. Advances in Evidence Synthesis: special issue Cochrane Database of Systematic Reviews

Development of an interactive web-based tool to conduct and interrogate meta-analysis of diagnostic test accuracy studies: MetaDTA

QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies

Cochrane handbook for systematic reviews of diagnostic test accuracy version 1.0. The Cochrane Collaboration

Bivariate analysis of sensitivity and specificity produces informative summary measures in diagnostic reviews

Bivariate meta-analysis of sensitivity and specificity with sparse data: a generalised linear mixed model approach

A unification of models for meta-analysis of diagnostic accuracy studies

Fitting linear mixed-effects models using lme4

Validation of Bahasa Malaysia STOP-BANG questionnaire for identification of obstructive sleep apnea

Evaluation of the Arabic version of STOP-Bang questionnaire as a screening tool for obstructive sleep apnea

Validation of the Persian version of Berlin Sleep Questionnaire for diagnosing obstructive sleep apnea

Comparison of Berlin questionnaire, STOP-Bang, and Epworth sleepiness scale for diagnosing obstructive sleep apnea in Persian patients

Which screening questionnaire is best for predicting obstructive sleep apnea in the sleep clinic population considering age, gender, and comorbidities?

Modified Mallampati score improves specificity of STOP-BANG questionnaire for obstructive sleep Apnea

The validity and reliability of an Arabic version of the STOP-Bang Questionnaire for identifying obstructive sleep apnea. The Open Respiratory Medicine

Validation of the STOPBANG Questionnaire among patients referred for suspected obstructive sleep apnea

Comparison of clinical scores in their ability to detect hypoxemic severe OSA patients

Validation of the Spanish version of the STOP-Bang Questionnaire: usefulness as a screening tool for obstructive sleep apnea in adults

Validation of the STOP-Bang questionnaire as a means of screening for obstructive sleep apnea in adults in Brazil

Obstructive sleep apnea screening with a 4-item instrument, named GOAL questionnaire: development, validation and comparative study with No-Apnea, STOP-Bang, and NoSAS

Comparison of four sleep questionnaires for screening obstructive sleep apnea

Evaluation and validation of four translated Chinese questionnaires for obstructive sleep apnea patients in Hong Kong

Reliability and validity of simplified Chinese STOP-BANG questionnaire in diagnosing and screening obstructive sleep apnea hypopnea syndrome

Comparative study of four Persian versions of sleep questionnaires for screening obstructive sleep apnea syndrome (OSAS)

Evaluation of Berlin questionnaire validity for sleep apnea risk in sleep clinic populations

The utility of three screening questionnaires for obstructive sleep apnea in a sleep clinic setting

Simplifying STOP-BANG: use of a simple questionnaire to screen for OSA in an Asian population

During economic crisis can sleep questionnaires improve the value of oximetry for assessing sleep apnea

The evaluation of the Croatian version of the Epworth sleepiness scale and STOP questionnaire as screening tools for obstructive sleep apnea syndrome

Comparing a combination of validated questionnaires and level III portable monitor with polysomnography to diagnose and exclude sleep apnea

Comparison of three sleep questionnaires in screening obstructive sleep apnoea

The STOP-BANG questionnaire: reliability and validity of the Persian version in sleep clinic population

Development of Arabic version of Berlin questionnaire to identify obstructive sleep apnea at risk patients

Incorporating body-type (apple vs. pear) in STOP-BANG questionnaire improves its validity to detect OSA

Reliability and validity of a Thai version of the Berlin questionnaire in patients with sleep disordered breathing

Predictive abilities of the STOP-Bang and Epworth Sleepiness Scale in identifying sleep clinic patients at high risk for obstructive sleep apnea

The effect of adding gender item to Berlin Questionnaire in determining obstructive sleep apnea in sleep clinics

Validation of the Berlin questionnaire and American Society of Anesthesiologists checklist as screening tools for obstructive sleep apnea in surgical patients

STOP Questionnaire: a tool to screen patient for obstructive sleep apnea

Serum bicarbonate level improves specificity of STOP-Bang screening for obstructive sleep apnea

Alternative scoring models of STOP-Bang questionnaire improve specificity to detect undiagnosed obstructive sleep apnea

Pre-operative ability of clinical scores to predict obstructive sleep apnea (OSA) severity in susceptible surgical patients

Critical evaluation of screening questionnaires for obstructive sleep apnea in patients undergoing coronary artery bypass grafting and abdominal surgery

BMI 35 kg/m 2 does not fit everyone: a modified STOP-Bang questionnaire for sleep apnea screening in the Chinese population

Performance of NoSAS score versus Berlin questionnaire for screening obstructive sleep apnoea in patients with resistant hypertension

Diagnostic accuracy of the Berlin questionnaire in detecting obstructive sleep apnea in patients with resistant hypertension

Screening for obstructive sleep apnea syndrome in asthma patients: a prospective study based on Berlin and STOP-Bang questionnaires

Diagnostic accuracy of a questionnaire and simple home monitoring device in detecting obstructive sleep apnoea in a Chinese population at high cardiovascular risk

Comparison of four established questionnaires to identify highway bus drivers at risk for obstructive sleep apnea in Turkey

Translation and validation of Berlin questionnaire in primary health care in Greece

Validation of the Malay version of Berlin questionnaire to identify Malaysian patients for obstructive sleep apnea

Diagnostic properties of the STOP-bang and its modified version in screening for obstructive sleep apnea in Thai patients

The performance of tests of publication bias and other sample size effects in systematic reviews of diagnostic test accuracy was assessed

Using the Berlin Questionnaire to identify patients at risk for the sleep apnea syndrome

Validation of the STOP-Bang questionnaire as a screening tool for obstructive sleep apnea among different populations: a systematic review and metaanalysis

Validity of the Berlin questionnaire in detecting obstructive sleep apnea: a systematic review and meta-analysis

Unrecognized sleep apnea in the surgical patient: implications for the peri-operative setting

Postoperative complications in patients with obstructive sleep apnea: a retrospective matched cohort study

Effect of preoperative obstructive sleep apnea on the frequency of atrial fibrillation after coronary artery bypass grafting

The OSACOVID-19 Study Collaborators (2021) Effect of high-risk obstructive sleep apnea on clinical outcomes in adults with coronavirus disease 2019: a multicenter, prospective, observational cohort study

Beyond diagnostic accuracy: the clinical utility of diagnostic tests

Sleeprelated breathing disorders in adults: recommendations for syndrome definition and measurement techniques in clinical research

The AASM manual for the scoring of sleep and associated events: rules, terminology and technical specifications. First Edition

Rules for scoring respiratory events in sleep: update of the 2007 AASM manual for the scoring of sleep and associated events

Hypopnea definitions, determinants and dilemmas: a focused review

Comparison of hypopnea definitions in lean patients with known obstructive sleep apnea hypopnea syndrome (OSAHS)