key: cord-0299227-dao2t890 authors: Wallace, W.; Chan, C.; Chidambaram, S.; Hanna, L.; Iqbal, F.; Acharya, A.; Normahani, P.; Ashrafian, H.; Markar, S. R.; Sounderajah, V.; Darzi, A. title: The diagnostic and triage accuracy of digital and online symptom checker tools: a systematic review date: 2021-12-21 journal: nan DOI: 10.1101/2021.12.21.21268167 sha: adc7debf52671cfa21082dfe95c390a933e43454 doc_id: 299227 cord_uid: dao2t890 Objective To evaluate the accuracy of digital and online symptom checkers in providing diagnoses and appropriate triage advice. Design Systematic review. Data sources Medline and Web of Science were searched up to 15 February 2021. Eligibility criteria for study selection Prospective and retrospective cohort, vignette, or audit studies that utilised an online or application-based service designed to input symptoms and biodata in order to generate diagnoses, health advice and direct patients to appropriate services were included. Main outcome measures The primary outcomes were (1) the accuracy of symptom checkers for providing the correct diagnosis and (2) the accuracy of subsequent triage advice given. Data extraction and synthesis Data extraction and quality assessment (using the QUADAS-2 tool) were performed by two independent reviewers. Owing to heterogeneity of the studies, meta-analysis was not possible. A narrative synthesis of the included studies and pre-specified outcomes was completed. Results Of the 177 studies retrieved, nine cohort studies and one cross-sectional study met the inclusion criteria. Symptom checkers evaluated a variety of medical conditions including ophthalmological conditions, inflammatory arthritides and HIV. 50% of the studies recruited real patients, while the remainder used simulated cases. The diagnostic accuracy of the primary diagnosis was low (range: 19% to 36%) and varied between individual symptom checkers, despite consistent symptom data input. Triage accuracy (range: 48.8% to 90.1%) was typically higher than diagnostic accuracy. Of note, one study found that 78.6% of emergency ophthalmic cases were under-triaged. Conclusions The diagnostic and triage accuracy of symptom checkers are variable and of low accuracy. Given the increasing push towards population-wide digital health technology adoption, reliance upon symptom checkers in lieu of traditional assessment models, poses the potential for clinical risk. Further primary studies, utilising improved study reporting, core outcome sets and subgroup analyses, are warranted to demonstrate equitable and non-inferior performance of these technologies to that of current best practice. PROSPERO registration number CRD42021271022. What is already known on this topic Chambers et al. (2019) have previously examined the evidence underpinning digital and online symptom checkers, including the accuracy of the diagnostic and triage information, for urgent health problems and found that diagnostic accuracy was generally low and varied depending on the symptom checker used. Given the increased reliance upon digital health technologies by health systems in light of the ongoing COVID-19 pandemic, in addition to the marked increase in availability of similarly themed digital health products since the last systematic review, a contemporary and comprehensive reassessment of this class of technologies to ascertain their diagnostic and triage accuracy is warranted. Our systematic review demonstrates that the diagnostic accuracy of symptom checkers remains low and varies significantly depending on the pathology or symptom checker used. The findings of this systematic review suggests that this class of technologies, in their current state, poses significant risk for patient safety, particularly if utilised in isolation. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted December 21, 2021. ; https://doi.org/10.1101/2021.12.21.21268167 doi: medRxiv preprint Digital and online symptom checkers (SCs) are application or software tools that enable patients to input their symptoms and biodata to produce a set of differential diagnoses and clinical triage advice. The diagnostic function of SCs is to provide a list of differential diagnoses, ranked by likelihood. [1] The triage function highlights to end-users the most appropriate course of action regarding their potential diagnosis, which typically includes seeking urgent care; contacting their general practitioner; or self-care. SCs have become an increasingly prominent feature of the modern healthcare landscape due to increasing access to internet connectivity and capacity to access personalised self-care advice. In 2020, 96% of UK households had internet access, of which over one-third of adults used the internet to self-diagnose health-related issues. [2, 3] Governments have also incorporated SCs to alleviate the increasing burden that is placed upon both primary care services and emergency services, particularly in light of the COVID-19 pandemic. [4] [5] [6] It has been previously estimated that 12% of Emergency Department (ED) attendances would be more appropriately managed by other services. [7, 8] Hence, SCs can reduce the financial and resource burden of the NHS, and redirect them towards truly in need. Public and private companies have advertised SCs to be a cost-effective solution by serving as a first port-of-call for patients and effectively signposting patients to the most appropriate healthcare service. When used appropriately, SCs can advise patients with serious conditions to seek urgent attention and conversely prevent those with problems best resolved through selfcare from unnecessarily seeking medical attention. [1] However, all of the aforementioned health, organisational and financial benefits of SCs are heavily dependent on the accuracy of diagnostic and triage advice provided. Over-triaging those with non-urgent ailments will exacerbate the unnecessary use of healthcare services. Conversely and more seriously, inaccuracies in diagnosing and triaging patients with life-threatening conditions could result in preventable morbidity and mortality. [1, 9] In fact, SCs have previously received heavy media criticism for not correctly diagnosing cancer, cardiac conditions, and providing differing advice to patients with same symptomatology but different demographic characteristics. [10] [11] [12] These alleged reports raise concerns around the possibility that these systems may deliver unequitable clinical performance across differing gender and sociodemographic groups. In a previous systematic review, Chambers et al. (2019) assessed SCs on their safety and ability to correctly diagnose and distinguish between high and low acuity conditions. [13] The diagnostic accuracy was found to be variable between different platforms and was generally low. Given the rapid expansion in commercially available digital and online SCs, a more updated review is warranted to determine if this is still the case. Thus, this review aims to systematically evaluate the currently available literature regarding (1) the accuracy of digital and online SCs in providing diagnoses and appropriate triage advice as well as (2) the variation in recommendations provided by differing systems given homogenous clinical input data. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint This systematic review was conducted according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines and was registered in the PROSPERO registry (ID: CRD42021271022). [14] Prospective and retrospective cohort, vignette, or audit studies were included. Studies that utilised an online or application-based service designed to input symptoms and biodata (i.e., age and gender) in order to generate diagnoses, health advice and direct patients to appropriate services were included. All study populations, including patients, patient cases or simulated vignettes were included. Studies were included regardless of the condition(s) being assessed or the SC used. Included studies had to quantitatively evaluate the accuracy of the SC service. Excluded articles included descriptive studies, abstracts, commentaries, and study protocols. Only articles written in the English language were included. Following PRISMA recommendations, an electronic database search was conducted using MEDLINE and Web of Science to include articles up to 15 February 2021 (search strategy detailed in supplementary text). Reference lists of the studies included in the review synthesis were examined for additional articles. Search results were then imported into Mendeley (RELX, UK) for duplicate removal and study selection. Screening of articles was performed independently by two investigators (W.W. and C.C.). Uncertainties were resolved through discussion with a third and fourth author (S.C and V.S). Key data were extracted and tabulated from the included studies, including details of study design, participants, interventions, and SCs used, comparators and reported study outcomes. Data extraction was performed independently by two investigators (W.W. and C.C.). The primary outcomes of this systematic review were (1) the accuracy of SCs for providing the correct diagnosis and (2) the accuracy of subsequent triage advice given (i.e., whether the acuity of the medical issue was correctly identified, and patients were signposted to appropriate services). The secondary outcome of assessing variation in recommendations within studies of consistent clinical input data can be calculated from these extracted outcomes. Due to heterogeneity of the included studies' design, methodology and reported outcomes, a meta-analysis was not performed. A narrative synthesis of the included studies and pre-specified outcomes was instead carried out. Study bias was assessed using the Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) tool. [15] The risk of bias was assessed across the four domains by two investigators (W.W. and S.C.), any disagreements were discussed and resolved by a third author (V.S). The risk of bias of each domain was categorised as low, unclear, or high. No patients were involved in the design or conduct of this study or writing and editing of the manuscript. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted December 21, 2021. ; https://doi.org/10.1101/2021.12.21.21268167 doi: medRxiv preprint The literature search yielded nine cohort studies and one cross-sectional study that met the inclusion criteria. Figure 1 presents the flow of studies through the screening process. An overview of the risk of bias assessment using QUADAS-2 can be found in Figure 2 . All studies bar one had domains of "unclear" or "high" risk of bias or applicability concerns. Six studies had one or more domains at "high" risk of bias.[16- Characteristics of included studies can be found in Table 1 Nine studies evaluated SC diagnostic accuracy (Table 2) . Overall primary diagnostic accuracy (i.e., listing the correct diagnosis first) was low in all studies, ranging from 19% to 38% ( Figure 3 Six studies examined the accuracy of SCs in providing correct triage advice ( Table 2) . Overall triage accuracy tended to be higher than diagnostic accuracy, ranging from 49% to 90% ( Figure 4) , respectively). [1, 19] However, another study demonstrated that accuracy of triage advice for ophthalmic emergencies was significantly lower than . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted December 21, 2021. ; 1 0 This systematic review evaluated the diagnostic and triage accuracy of symptom checkers for a variety of medical conditions using both simulated and real-life patient vignettes. Our review highlighted that both diagnostic and triage accuracy were generally low. Moreover, there is considerable variation in their performance despite consistencies in the input parameters. We also note that the diagnostic and triage accuracies of SCs, as well as the variation in performance, was greatly dependent on the acuity of the condition assessed. As a whole, these issues raise multiple concerns for the use of SCs as patient-facing tools, especially given their increasing role as triage services that direct patients towards appropriate treatment pathways. Both SCs and telephone triage have been promoted as a means to reduce unnecessary GP and ED attendances. However, an inaccurate SC (one that does not suggest a correct set of diagnoses or provide safe triage advice), could expose patients to considerable preventable harm. When unsafe triage advice is paired with an incorrect set of differential diagnoses, this alignment of errors increases the likelihood for clinical harm of patients, not unlike the Swiss-Cheese model that is cited in aviation safety reports.[25] For example, Babylon, a NHS-backed SC, has been alleged to suggest that a breast lump may not necessarily represent cancer and it has also been reported to have misdiagnosed myocardial infarctions as panic attacks. [10, 11] While there will be instances where probability-based clinical decision-making tools are incorrect, a safety-first approach needs to be employed for specific high-risk conditions with necessary adjustments for low-risk symptoms that may mask or mimic more life-threatening problems. Variability in accuracy is a concerning recurrent theme in the included studies and indicates that patients are provided with heterogeneous advice, dependent on the SC used, and condition assessed, resulting in a spectrum of issues. Variability combined with poor diagnostic and triage accuracy presents a multidimensional system of potential patient harm. Although 'undertriaging' has clearly appreciable deleterious effects to patient wellbeing, it is worth noting that 'overtriaging' manifests in inappropriate health resource utilisation through unnecessary presentation to emergency services. Although this does not impact the health of the primary SC user, it does confer a knock-on opportunity cost that is shouldered by those are truly in need of emergency services and are left waiting for medical attention. Although the impact of variable triaging advice from SCs has yet to be robustly researched, the highly varied accuracy between SCs noted in this review suggests that there is considerable scope for discrepancies in quality and health outcomes. This raises further questions regarding the safety of SCs. Many cite that the poor transparency and reporting of SC development and clinical validation limits the extent to which they can be reliably endorsed for population wide use between health systems. Minimal evidence is provided regarding the context, patient demographics and clinical information that is used to create SCs. This is reflected in the high risk of bias evident from the quality assessment of the included studies, with little elaboration of patient selection, comparator groups or index tests used. This can be improved by clearly stating what the intended use case and coverage of SCs will be. Coverage (i.e. what . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. conditions and patient populations are accounted for by the software) must be explained, especially since SCs may not account for geographical or country-specific variations in disease prevalence, thus impacting applicability and potential utility. This could be further complemented with the open publication of algorithms, study protocols and datasets pertaining to SCs. Moreover, SCs currently do not display suitable explainability metrics, in which they highlight how they arrive to their recommendations. This would significantly increase the ability to effectively audit these devices as well as increase trust from both patients and healthcare professionals in the outputs they provide. Overall trust is also hindered by claims of several SCs to purported house 'AI algorithms' as part of their diagnostic process, despite not providing any convincing evidence as to how this is indeed the case. Lastly, there was a noticeable absence of middle and low economically developed countries, which are likely to exhibit different health seeking behaviours, digital literacy rates, and disease burden. Increased regulation of SCs via health technology assessments is essential given the wide impact of such potent technologies and has been advocated for previously. [9, 29] In the UK, although the MHRA has recently outlined concerns regarding SCs, the long-awaited expansion of the current regulations for software-based medical products to adequately cover SCs has not yet been realised. [30] Increased regulatory scrutiny is greatly needed in the near future given the rapidly-progressing nature of this field. Robust clinical validation and testing with is warranted to improve current software trustworthiness and reliability. Digital health studies that form the basis for SCs need to be carried out with greater methodological rigour and transparency. This can be achieved by incorporating core outcome sets; realworld patient data encompassing greater demographic spread, particularly ethnicity which is not captured in the shortlisted studies despite often being a key risk factor in many pathologies; and more refined comparator groups that include biochemical, clinical, and radiological diagnosis when feasible. While SCs . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this this version posted December 21, 2021. ; https://doi.org/10.1101/2021.12.21.21268167 doi: medRxiv preprint fulfil the need for telemedicine, further work should also evaluate whether SCs truly are better than more traditional telephone triage lines, especially in terms of cost-effectiveness, as this service also provides 'socially distanced' and personalised health information. More importantly, there is an unmet need for educating patients in using these tools and appreciating their limitations. While the variation in digital health literacy has previously been established, more effort is required to address and correct its socio-economic drivers. In our review, SC diagnostic and triage accuracy varied substantially and was generally low. Variation exists between different SCs and the conditions being assessed; this raises safety and regulatory concerns. Given the increasing trend of telemedicine use and, even the endorsement of certain applications by the NHS, further work should seek to introduce regulation and establish datasets to support their development and improve patient safety. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The Corresponding Author has the right to grant on behalf of all authors and does grant on behalf of all authors, a worldwide licence to the Publishers and its licensees in perpetuity, in all forms, formats and media (whether known now or created in the future), to i) publish, reproduce, distribute, display and store the Contribution, ii) translate the Contribution into other languages, create adaptations, reprints, include within collections and create summaries, extracts and/or, abstracts of the Contribution, iii) create any other derivative work(s) based on the Contribution, iv) to exploit all subsidiary rights in the Contribution, v) the inclusion of electronic links from the Contribution to third party material where-ever it may be located; and, vi) licence any third party to do any or all of the above. All authors have completed the ICMJE uniform disclosure form at http://www.icmje.org/disclosure-ofinterest/ and declare: all authors had infrastructure support from the NIHR Imperial Biomedical Research Centre for the submitted work; no financial relationships with any organisations that might have an interest in the submitted work in the previous three years; no other relationships or activities that could appear to have influenced the submitted work. N o t r e q u i r e d . The lead author (the manuscript's guarantor) affirms that the manuscript is an honest, accurate, and transparent account of the study being reported; that no important aspects of the study have been omitted; and that any discrepancies from the study as planned (and, if relevant, registered) have been explained. The search strategy is available in the supplementary material; any additional data are available on request. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted December 21, 2021. ; https://doi.org/10.1101/2021.12.21.21268167 doi: medRxiv preprint 16a Describe the results of the search and selection process, from the number of records identified in the search to the number of studies included in the review, ideally using a flow diagram. 16b Cite studies that might appear to meet the inclusion criteria, but which were excluded, and explain why they were excluded. 7 Study characteristics 17 Cite each included study and present its characteristics. Table 1 Risk of bias in studies 18 Present assessments of risk of bias for each included study. Figure 2 Results of individual studies 19 For all outcomes, present, for each study: (a) summary statistics for each group (where appropriate) and (b) an effect estimate and its precision (e.g. confidence/credible interval), ideally using structured tables or plots. Table 2 Results of syntheses 20a For each synthesis, briefly summarise the characteristics and risk of bias among contributing studies. n/a 20b Present results of all statistical syntheses conducted. If meta-analysis was done, present for each the summary estimate and its precision (e.g. confidence/credible interval) and measures of statistical heterogeneity. If comparing groups, describe the direction of the effect. n/a 20c Present results of all investigations of possible causes of heterogeneity among study results. 7 20d Present results of all sensitivity analyses conducted to assess the robustness of the synthesized results. n/a Reporting biases 21 Present assessments of risk of bias due to missing results (arising from reporting biases) for each synthesis assessed. 7 Certainty of evidence 22 Present assessments of certainty (or confidence) in the body of evidence for each outcome assessed. 7 23a Provide a general interpretation of the results in the context of other evidence. 9 23b Discuss any limitations of the evidence included in the review. 10 23c Discuss any limitations of the review processes used. 10 23d Discuss implications of the results for practice, policy, and future research. 10 Registration and protocol 24a Provide registration information for the review, including register name and registration number, or state that the review was not registered. 6 24b Indicate where the review protocol can be accessed, or state that a protocol was not prepared. 6 24c Describe and explain any amendments to information provided at registration or in the protocol. n/a Support 25 Describe sources of financial or non-financial support for the review, and the role of the funders or sponsors in the review. 12 Competing interests 26 Declare any competing interests of review authors. 12 Availability of data, code and other materials 27 Report which of the following are publicly available and where they can be found: template data collection forms; data extracted from included studies; data used for all analyses; analytic code; any other materials used in the review. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) Evaluation of symptom checkers for self diagnosis and triage: Audit study Internet access -households and individuals Web use for symptom appraisal of physical health conditions: A systematic review Online symptom checker applications: Syndromic surveillance for international health Waiting Time as an Indicator for Health Services Under Strain: A Narrative Review Limited evidence of benefits of patient operated intelligent primary care triage tools: Findings of a literature review 1 in 4 GP appointments potentially avoidable Who uses emergency departments inappropriately and whena national cross-sectional study using a monitoring data system Safety of patient-facing digital symptom checkers NHS-backed GP chatbot is branded a 'public health danger'. Dly. Mail Online It's hysteria, not a heart attack, GP app Babylon tells women. The Sunday Times Calm down dear, it's only an aneurysm' -why doctors need to take women's pain seriously. Guard Digital and online symptom checkers and health assessment/triage services for urgent health problems: Systematic review