key: cord-1008467-tmuvbq1k authors: Munsch, N.; Martin, A.; Gruarin, S.; Nateqi, J.; Abdarahmane, I.; Weingartner-Ortner, R.; Knapp, B. title: A benchmark of online COVID-19 symptom checkers date: 2020-05-26 journal: nan DOI: 10.1101/2020.05.22.20109777 sha: 0a9ed84b2d6679a8348ec48b48d493b2931998d5 doc_id: 1008467 cord_uid: tmuvbq1k Background A large number of online COVID-19 symptom checkers and chatbots have been developed but anecdotal evidence suggests that their conclusions are highly variable. To our knowledge, no study has evaluated the accuracy of COVID-19 symptom checkers in a statistically rigorous manner. Methods In this paper, we evaluate 10 different COVID-19 symptom checkers screening 50 COVID-19 case reports alongside 410 non-COVID-19 control cases. Results We find that the number of correctly assessed cases varies considerably between different symptom checkers, with Symptoma (F1=0.92, MCC=0.85) showing the overall best performance followed by Infermedica (F1=0.80, MCC=0.61). In the modern world, large numbers of patients initially turn to various online sources for self-diagnoses before seeking diagnoses from a trained medical professional. But web sources have inherent problems such as misinformation, misunderstandings, misleading advertisements and varying quality [1] . Interactive examples of web sources developed to meet the need of online diagnoses are sometimes referred to as symptom checkers or chatbots [2] [3] . Based on a list of entered symptoms and other factors, symptom checkers return a list of potential diseases. Online symptom checkers have become popular in the context of the novel coronavirus disease 2019 (COVID-19) pandemic as access to doctors is reduced, worry in the population is high, and lots of misinformation is circulating the web [1] . On COVID-19 symptom checker web pages users are asked a series of COVID-19 specific questions and, upon completion, an association between the answers and COVID-19 is given alongside behavioural recommendations, e.g., self-isolate. COVID-19 symptom checkers are valuable tools for pre-assessment and screening during this pandemic, both taking pressure off from clinicians and reducing footfall within hospitals. A large number of symptom checkers specific to COVID-19 have been developed. Anecdotal evidence (e.g. a newspaper article [4] ) suggests that their conclusions differ with possible implications on the quality of the symptom assessment. To our knowledge, there exist no studies comparing and evaluating COVID-19 symptom checkers. In the following, we present a study evaluating 10 different COVID-19 online symptom checkers using 50 COVID-19 cases extracted from the literature and 410 non-COVID-19 control cases of patients with other diseases. We find that the COVID-19 symptom checkers' classification of many patient cases differ as well as their accuracies. Symptoma (F1=0.92, MCC=0.85) shows the overall best performance followed by Infermedica (F1=0.80, MCC=0.61). Ten COVID-19 symptom checkers that were freely available online between 3rd and 9th of April 2020 were selected for this study ( Table 1 ). These symptom checkers were used in the versions available in this date range and updates after this date were not considered for analysis. As a baseline for the performance evaluation of the 10 online COVID-19 symptom checkers, we developed two additional simplistic symptom checkers. These two checkers evaluate and weigh the presence of WHO [5] provided COVID-19 symptom frequencies (see S1 Table ) based on vector distance (SF-DIST) and cosine similarity (SF-COS). These approaches can be implemented in a few lines of code (see S2 Text ). Name URL Control cases COVID-19 cases allow us to evaluate the sensitivity of symptom checkers. To also evaluate the specificity, 410 control cases from the British Medical Journal (BMJ) were sourced [6, 7] . To allow a fair assessment, we only used cases containing at least one of the COVID-19 symptoms (see S4 Table ) reported by the WHO [5] . Classifying non-relevant cases (e.g., a fracture) would overestimate the symptom checkers' specificity. Furthermore, these patients would not consult an online COVID-19 symptom checker. None of these 410 BMJ cases has COVID-19 listed as the diagnosis as the cases where collected before the COVID-19 outbreak. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 26, 2020. Most COVID-19 symptom checkers return a human-readable text which contains an association between entered symptoms and COVID-19. We classified these associations into three different categories: high risk, medium risk and low risk. Examples of a high, medium and low risk classifications are "There is a high risk that COVID-19 is causing your symptoms", "Your symptoms are worrisome and may be related to COVID-19" and "There's nothing at present to suggest that you have coronavirus (COVID-19). Please practice physical/social distancing" respectively. Our full text-output to risk mapping for all symptom checkers and all text outputs is given in S5 Table . Some symptom checkers only have two possible outputs: COVID-19 risk or no COVID-19 risk. In order to make symptom checkers with three and two risk levels comparable we performed . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) two analysis versions: (a) medium risk and high risk is treated as COVID-19 positive (and low risk as COVID-19 as negative) and (b) high risk is treated as COVID-19 positive (and low risk and medium risk as COVID-19 negative). To evaluate the robustness of our statistical measures and account for the unbalanced dataset, we performed bootstrapping across our cases. A total of 3000 random samples consisting of 50 COVID-19 cases and 50 control cases were created by sampling with replacement from the original set of 50 COVID-19 cases and the 410 control cases. In order to analyse the performance of the 10 online symptom checkers, we calculated the sensitivity and the specificity of each symptom checker based on the cases described in the method section. A scatterplot between sensitivity and specificity to COVID-19 of the different symptom checkers is given in Fig 1 and Further analysis of true and false case classifications of these groups shows that the group in the upper left corner is composed of symptom checkers that require one (or few) highly specific symptoms to be present in order to classify a case as COVID-19 positive (e.g. "intensive contact with a COVID-19 positive person"). By this way, these symptom checkers miss many COVID-19 positive patients that did not report exactly this highly specific symptom. Vice versa such highly specific symptoms are hardly present in non-COVID-19 cases. This results in low sensitivity and high specificity. The group in the lower right corner is composed of symptom checkers which predict a case as COVID-19 positive based on the presence of one or few COVID-19 associated symptoms, e.g. the presence of fever or cough is enough to predict a patient as COVID-19 positive. These checkers classify nearly every patient that has a respiratory disorder or viral infection as COVID-19 positive. As such, they do not miss many COVID-19 patients but wrongly predict many non-COVID-19 patients as COVID-19 positive. This results in low specificity and high sensitivity. The group in the more central region is composed of symptom checkers which use a more balanced prediction but exhibit limited success correctly classifying COVID-19 and non-COVID-19 patients. The group in the upper right corner is composed of symptom checkers which also use a more balanced "symptoms to COVID-19 association model" but in this case, the classification between COVID-19 and non-COVID-19 patients is more successful. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 26, 2020. . is defined by either a medium risk or high risk returned by a symptom checker. As Symptoma exhibits the best combination of sensitivity and specificity, we focused our analysis on Symptoma's performance. Symptoma allows free-text input of one's symptoms and thereby a more precise representation of the clinical test cases. The other symptom checkers do not allow free text input which limits the number of possible symptoms considerably ( Fig 2A ) . In order to investigate how Symptoma would perform if constrained, we performed pairwise . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 26, 2020. . https://doi.org/10.1101/2020.05.22.20109777 doi: medRxiv preprint comparisons where Symptoma is only allowed to use the symptoms of another symptom checker. In this setup, Symptoma is massively disadvantaged as it can not use its full abilities. For example, in the pairwise comparison with "Your.MD", Symptoma considers only "fever", "dry cough", "shortness of breath", and "contact with a confirmed COVID-19 case" for the classification of cases. The results of this analysis are summarised in Fig 2B , Table and S10 Table . Under these constraints and when COVID-19 positive is defined by high risk only, Symptoma still significantly outperforms Apple and Cleveland Clinic, while performing statistically similar to six of the remaining symptom checkers (upper panel of Fig 2B ) . When COVID-19-positive is defined by high and medium risk (lower panel of Fig 2B ) , Symptoma's constrained performance is similar to seven of the other checkers, while outperforming Ada and Docyet. For Apple, Babylon, CDC, Cleveland Clinic, Providence and "Your.MD" the performance is about the same. When Symptoma is allowed to use all symptoms of the case descriptions, it clearly outperforms all other checkers (dashed blue line in Fig 2B ) . This suggests that performance is directly related to the number of symptom's any given checker considers as input, and as such, free-text input (non-constrained) will normally lead to a higher likelihood of correct diagnosis. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 26, 2020. . . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 26, 2020. . https://doi.org/10.1101/2020.05.22.20109777 doi: medRxiv preprint We classified 50 COVID-19 case descriptions from the recent literature as well as 410 non-COVID-19 control cases with ten different online COVID-19 symptom checkers. Only two out of ten symptom checkers showed a reasonably good balance between sensitivity and specificity: namely Infermedica (F 1 =0.80) and Symptoma (F 1 =0.92). Most other checkers are either too sensitive, classifying almost all patients as COVID-19 positive, or too specific, classifying many COVID-19 patients as COVID-19 negative (see Fig 1 ) . For example, our BMJ control cases contain a patient suffering from a pulmonary disease who presents with various symptoms, including fever, cough and shortness of breath, the three most frequent symptoms associated with COVID-19. Symptoma uses the additional symptoms and risk factors not considered by the other checkers, namely loss of appetite, green sputum, and a history of smoking, to discern the correct diagnosis of COVID-19 negative. Five of the other checkers consider this case as high risk. Furthermore, most of the symptom checkers are even out-performed by our simplistic symptom frequency vector approaches (SF-DIST (F 1 =0.57) and SF-COS (F 1 =0.79)). Notably, the cosine version shows surprisingly good results outperforming 8 out of 10 symptom checkers based on the F 1 score. To our knowledge this is the first scientific evaluation of online COVID-19 symptom checkers, however, there are a number of related studies evaluating symptom checkers. These include a study that evaluated 23 general-purpose symptom checkers based on 45 clinical case descriptions across a wide range of medical conditions and found that the correct diagnosis was on average listed among the top 20 results of the checkers in 58% of all cases [2] . This study design was extended for five additional symptom checkers using ear, nose and throat (ENT) . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 26, 2020. . cases showing similar results [8] . Other evaluations include symptom checkers used for knee pain cases that found, based on 527 patients and 26 knee problems, that the physician's diagnosis was present within the prediction list in 89% of the cases while the specificity was only 27% [9] . In another study, an analysis of a university students' automated self-assessment triage system prior to an in-person consultation with a medical doctor found that the system's urgency rating agreed perfectly in only 39% of cases while for the remaining cases the system tended to be more risk averse than the doctor [10] . Also, the applicability of online symptom checkers for 79 persons aged ≥50 years based on "think-aloud" protocols [11] , deep learning algorithms for medical imaging [12] , and services for urgent care [3] were evaluated. If the performance of any (COVID-19) online symptom checker is acceptable depends on the perspective and use of the results. In the case of COVID-19, an online assessment can not fully replace a PCR-test as some people are asymptomatic, while others presenting with very specific COVID-19 symptoms might, in fact, have a very similar but different disease. Regardless, online COVID-19 symptom checkers can act as a first triage shield to take pressure off from in-person physician visits or hospitals. Symptom checkers could even replace telephone triage lines in which non-medically trained personnel read a predefined sequence of questions. Even though this was not part of this study, the authors believe that COVID-19 symptom checkers (if appropriately maintained and tested) might also be more reliable than the direct use of search engines such as Google or information via social media. The strength of this study lies in the fact that it is based on a large number (n=460) of real patients' case descriptions from the literature and a detailed evaluation on the best performing symptom checker ( Fig 2 ) . Vice versa, a potential weakness of this study lies in using real literature-based cases, which might have biased the test set to rather severe cases of COVID-19, as mild and uninteresting cases are usually not found in the literature. We countered . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 26, 2020. . this bias by not including extreme edge cases from the literature into our 50 COVID-19 cases. Another bias might be that our control case descriptions do not report a "COVID-19 contact", even though a person with, for example a cold, might have had a COVID-19 contact (and did not get infected). Another limit of this study is the non-straight forward mapping of the symptom checker outputs to risk levels ( S5 Table ) . The interpretation of the textual output is debatable in some cases. We countered this by allowing three different risk levels and merging them together in two different ways (see Fig 1 A and Fig 1 B) . We also classified every symptom checker output by multiple persons until consensus was reached. Symptom checkers are being widely used in response to the COVID-19 global pandemic. As such, quality assessment of these tools is critical. We show that various online COVID-19 symptom checkers vary widely in their predictive capabilities, with some performing equivalently to randomly guessing, while others, namely Symptoma (F 1 = 0.92) and Infermedica (F 1 = 0.80), exhibiting high accuracy. All authors are employees of Symptoma GmbH. JN holds shares of Symptoma. This study has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 830017. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 26, 2020. . https://doi.org/10.1101/2020.05.22.20109777 doi: medRxiv preprint Supporting information S1 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 26, 2020. Output : return disease with maximum similarity . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 26, 2020. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 26, 2020. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 26, 2020. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 26, 2020. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 26, 2020. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 26, 2020. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 26, 2020. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 26, 2020. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 26, 2020. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 26, 2020. . https://doi.org/10.1101/2020.05.22.20109777 doi: medRxiv preprint Call an ambulance immediately. Please tell them you have symptoms that may be caused by coronavirus (COVID-19) . High . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 26, 2020. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 26, 2020. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 26, 2020. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 26, 2020. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 26, 2020. . https://doi.org/10.1101/2020.05.22.20109777 doi: medRxiv preprint S10 Table. Full table of sensitivity, specificity, accuracy, F1 score and MCC for Symptoma constrained by each symptom checker (COVID-19 positive defined by "medium risk" or "high risk" for non binary symptom checkers) Impact of rumors or misinformation on coronavirus disease (COVID-19) in social media Evaluation of symptom checkers for self diagnosis and triage: audit study Digital and online symptom checkers and assessment services for urgent care to inform a new digital platform: a systematic review I asked eight chatbots whether I had Covid-19. The answers ranged from 'low' risk to 'start home isolation Report of the WHO-china joint mission on coronavirus disease 2019 (covid-19). 2020 From symptom to diagnosis-symptom checkers re-evaluated : Are symptom checkers finally sufficient and accurate to use? Accuracy of a Computer-Based Diagnostic Program for Ambulatory Patients With Knee Pain A study of automated self-assessment in a primary care student health centre setting Older adult experience of online diagnosis: results from a scenario-based think-aloud protocol Artificial intelligence versus clinicians: systematic review of design, reporting standards, and claims of deep learning studies