key: cord-0518673-optno66s
authors: Kannan, Anitha; Chen, Richard; Venkataraman, Vignesh; Tso, Geoffrey J.; Amatriain, Xavier
title: COVID-19 in differential diagnosis of online symptom assessments
date: 2020-08-07
journal: nan
DOI: nan
sha: c32d3f8ecbd4a9bb2d30db3ba62123237c9603de
doc_id: 518673
cord_uid: optno66s

The COVID-19 pandemic has magnified an already existing trend of people looking for healthcare solutions online. One class of solutions are symptom checkers, which have become very popular in the context of COVID-19. Traditional symptom checkers, however, are based on manually curated expert systems that are inflexible and hard to modify, especially in a quickly changing situation like the one we are facing today. That is why all COVID-19 existing solutions are manual symptom checkers that can only estimate the probability of this disease and cannot contemplate alternative hypothesis or come up with a differential diagnosis. While machine learning offers an alternative, the lack of reliable data does not make it easy to apply to COVID-19 either. In this paper we present an approach that combines the strengths of traditional AI expert systems and novel deep learning models. In doing so we can leverage prior knowledge as well as any amount of existing data to quickly derive models that best adapt to the current state of the world and latest scientific knowledge. We use the approach to train a COVID-19 aware differential diagnosis model that can be used for medical decision support both for doctors or patients. We show that our approach is able to accurately model new incoming data about COVID-19 while still preserving accuracy on conditions that had been modeled in the past. While our approach shows evident and clear advantages for an extreme situation like the one we are currently facing, we also show that its flexibility generalizes beyond this concrete, but very important, example.

In a world where many people don't have access to essential healthcare services, and doctors have an average of 15 minutes per patient, it does not come as a surprise that a global pandemic like COVID-19 would place unprecedented stress on the global healthcare system. In this situation many have turned to telemedicine as a way to scale healthcareJudd E. Hollander and Brendan G. Carr [2020] . However, telemedicine on its own is just a different format for old workflows and processes. In order to scale telemedicine itself we need to increase both efficiency and accuracy of the outcomes by using AI and automation.

In fact, AI has been connected to medicine since the very beginning. Early AI approaches like expert systems have been used for decades as medical decision support tools. These expert systems are designed by domain experts (i.e. doctors) who build knowledge bases that are then used to reason about real-world situations. This approach is very similar to the more modern user-facing symptom checkers that have been popularized in the internet age. Online symptom checkers are assessment tools where users enter their symptoms and expect to get some guidance on their possible condition. These tools have even become more prevalent in the COVID-19 age, where many healthcare providers have tried to automate a response to the important question of "Do I have COVID-19?".

It is important to note that online COVID-19 symptom checkers are only able to give a response to the likelihood of a person having COVID-19, but they cannot give a holistic assessment of the patient. In other words, they cannot tell the patient that they are unlikely to have COVID-19, but rather should worry about strep throat. This highlights one of the main shortcoming of expert systems: they are hard to scale and lack flexibility. Adding a new condition to a well-tuned expert system requires re-tuning system by re-adjusting, mostly manually, all the existing probabilities. In the early stages of the COVID-19 pandemic where knowledge about the disease evolved on a daily basis, this methodology is not effective.

A more novel and very different approach to building diagnosis models is, of course, to use data and machine learning. Indeed, the recent advent of access to digital resources such as electronic health records holds promise as a major avenue for timely access to highly granular patient-level data. Large data repositories with detailed medical information can be mined through machine learning (ML) techniques to automatically learn fine-grained diagnosis models. These machine-learned models can capture patterns at different levels of granularity, and they can be easily extended and updated, as new data becomes available. On the downside, directly incorporating prior medical knowledge gathered through clinical research into machine learned models is difficult. Therefore, these ML models will only be as good as the data on which they are trained. This becomes particularly limiting in a situation like the one we are facing with COVID-19 in which data is hardly available and of limited quality.

In this paper we address the very timely and highly important question of whether we can quickly learn a generalized diagnosis model when a new condition like COVID-19 appears, even when the data around it is still questionable. In order to answer this question, we extend some previous work combining expert systems and machine learning approaches Ravuri et al. [2018] .

More concretely, the contributions of this work are the following:

1. We present a machine learning approach and method to quickly enhance an existing AI diagnosis model to incorporate a novel disease like COVID-19

2. We show that the resulting model is accurate in including COVID-19 in the differential diagnosis, without losing accuracy in diagnosing previously existing conditions 3. We show that the approach is easily extensible as new evidence or findings about the new condition are surfaced.

2 Related work Differential diagnosis as Inference: Early models for diagnosis were AI expert system models used as medical decision support systems for physicians (c.f. Mycin (Buchanan and Shortliffe [1985] ), Internist-1 (Miller et al. [1982] ), DXplain (Barnett et al. [1987] ) and QMR (Rassinoux et al. [1996] )). The goal of these systems was to emulate physicians' medical diagnostic ability to provide an "independent expert opinion" that could be leveraged by physicians when making a final decision. These systems have two components: an expert-curated knowledge base and an inference engine that is manually optimized. A fundamental limitation of these approaches and their probabilistic counterparts (c.f. Shawe and Cooper [1990] , Morris [2001] ) is the knowledge acquisition problem Gaines [2013] , Miller et al. [1986] : the knowledge base construction is time-consuming. Adding an extra disease requires weeks of work from expert physicians who need to corroborate evidences from multiple peer-reviewed publications and other sources. For rapidly evolving conditions such as the COVID-19 pandemic, it may take several years before it becomes part of the expert system. Note that not only scientific knowledge related to the specific disease needs to stabilize, but in order to add it to the knowledge base, we need to be able to model it in the presence of other diseases (e.g. how does the probability of someone having the flu given a high fever change given the possibility of that patient also now having COVID-19).

Machine learned models Machine learning provides a viable scalable path to quickly learn (and revise) models of differential diagnosis, as they depend on patient-level data available from sources such as electronic health records. While the initial work in this space has been on learning diagnostic codes (ICD) predictions using deep neural networks, either instantaneously or through time (c.f. Miotto et al. [2016] , Ling et al. [2017] , Shickel et al. [2017] , Rajkomar et al. [2018] , Liang et al.

[2019] and references therein), more recently, Ravuri et al. [2018] , Kannan et al. [2020] , these approaches have been applied to directly modeling the task of coming up with a differential diagnosis in a manner that is useful in patient-facing settings such as online symptom checkers. Ravuri et al. Ravuri et al. [2018] also introduce the idea of using expert systems as a data prior which forms the basis of our current work. We compare to that work in the next section.

We are interested in learning a user-facing machine learned model for differential diagnosis that can consider COVID-19 as a potential disease in its label space of diagnosis, and that such a model can be applied in online symptom checkers. We learn this model ( § 4) by combining data from two distinct sources. The first source is the AI medical expert system from which clinical cases are simulated to capture all but COVID-19 diseases ( § 3.1). The second source is the data collected from an online COVID assessment flow wherein flows that ended with low to medium risk are treated as positive examples of COVID-19 ( § 3.2).

Novelty of the approach: The proposed model is trained with an objective that takes into account the differential diagnosis of the clinical case, and not just a single disease. With our proposed model, we combine the best of both worlds: a very focused COVID-19 assessment tool to capture the changing guidelines and known medical evidence for COVID-19 (eg. incorporating anosmia as a new finding), and the generality of a symptom checker that understands many other conditions besides COVID-19. The duality of the approach allows the consideration of COVID-19 in a differential diagnosis whenever appropriate while also modeling the probabilities of competing hypothesis. This paper extends the work of Ravuri et al. [2018] in multiple directions. To the best of our knowledge, this paper is the first in incorporating COVID-19 as part of the differential diagnosis for an automated diagnosis assessment. Another difference to pre-existing work, that is particularly relevant for COVID-19, is that in this work we combine the data from the expert system with a snapshot of a dataset collected from an online COVID-19 assessment tool, as opposed to e.g. data from EHRs. In a rapidly evolving situation like the one we face today, data from EHR is still noisy and incomplete. Using data from a specific online assessment tool is a novel approach to not only quickly generate data, but also to incrementally improve it over time as new information about the condition is uncovered and added to the assessment. Finally, the model proposed in this paper directly operates on the findings/symptoms observed to be either positive or negative, while previous work uses tokens in findings as input vocabulary.

Simulator: We use simulation algorithm to create a large number of clinical vignettes from an extended version of the QMR knowledge base Miller and Masarie Jr [1990] to use as our dataset. This knowledge base contains of 830 diseases, 2052 findings (covering symptoms, signs, and demographic variables), and their relationships. Relationships between finding-disease pairs are encoded as evoking strength (ES) and term frequency (TF), with the former indicating the strength of association between the constituent finding-disease pair and the latter representing frequency of the finding in patients with the given disease.

The simulation algorithm [Parker and Miller, 1989, Ravuri et al., 2018 ] makes a closed world assumption with the universe of diseases (denoted Y) and findings (F ) being those in the knowledge base. Algorithm 1 outlines the algorithm. The simulator first samples a disease d ∈ Y and demographic variables, and then samples findings in proportion to frequency for the chosen disease. Each sampled finding is assigned to be present f pos or absent f neg , based on frequency. If assigned present, then findings that are impossible to co-occur are removed from consideration (e.g. a person cannot have both productive and dry cough). The simulation for a case ends when we sampled random (5-20) findings are recorded.

For our experiments, we limit to demographic variables and symptoms as these are the most likely available findings when first diagnosing a patient in a telehealth setting. After simulating the findings, there is still an uncertainty in the final diagnosis because of two main reasons: (1) The user facing findings may not sufficiently narrow down on a single diagnosis, and/or (2) the randomness in the number of findings chosen for the case may not be sufficient to arrive at a diagnosis. Therefore, we use the inference algorithm of the expert system to obtain a differential diagnosis, along with their scores. These scores are then normalized to represent probability distribution. We ignore cases for which the score distribution has high entropy. Fig. 2 provides example cases obtained using this algorithm. The differential diagnoses in each of these cases are peaked around a small number of diseases.

The resulting dataset consists of 65,000 distinct clinical cases with 437 diseases and 1418 findings. Each disease is supported by at least 50 clinical cases.

Algorithm 1 Clinical case simulation algorithm 1: Input: Medical knowledge base of diseases (D) and findings (F) with relationships between them encoded by FREQ(d,f), where FREQ(d,f) is transformed to represent p(f = 1|d); number of cases T. Expert Inference engine ExpertInf erence that takes us input a set of findings and provides the differential diagnosis. 2: Output:

pairs. such that y ∈ D and f ⊂ F is the set of findings needed to arrive at ddx. ddx is differential diagnosis consisting of K pairs of ( y ∈ D, s ∈ [0, 1] such that s k = 1 if f ∈ F * and rand() > F REQ(f, y) then 10:

Remove all findings that can not co-manifest with f (t) 12:

Sample number of findings in the case 13:

Next finding to be visited 17:

if rand() > F REQ(f, y (t) ) then 18:

The COVID-19 dataset used in our models was generated from a virtual diagnostic assessment that guides users through a comprehensive set of clinical questions asked by physicians to determine the likelihood of COVID-19 infection and associated complications from the disease (Figure 1 ). Questions in the assessment were based on guidance provided by the United States Centers for Disease Control and Prevention (CDC) at the time and elicited information regarding clinical factors including:

• Demographic information For our approach, we used two variants of data gathered by the assessment. In one variant, we restricted to only finding that are also part of the expert system. This removed factors ( Table 3) that were more reflective of the timing of the guidelines at the time of data collection and less reflective of inherent diagnostic criteria such as recent travel history and epidemiological data based on location and living with an individual with known COVID-19 infection. In the second variant, we included all the available findings.

We are interested in building a machine learned model that takes findings as input, and outputs a ranked list of possible conditions (a.k.a. differential diagnosis). We formulate this as a classification task, and use the output scores from the model to rank the conditions in the differential diagnosis. A clinical case consists of (a) a set of findings x pos ∈ F that are observed in the patient; (b) a disjoint set of findings x neg ∈ F that are explicitly observed to be not present in the patient; and (c) the corresponding differential diagnoses ddx such that ddx = {y j ∈ Y, s j } j=L j=1 , where s j (with j s j = 1) is the probability of y j to be considered the true underlying diagnosis. We also assume that demographic variables such as gender and age are also part of x pos . Clearly, x pos ∩ x neg = ∅. While one may argue that all findings that are not present in the patient are therefore absent, in reality, the doctors gather information that are pertinent to the observed findings so as to rule-out possibly related diseases. In order to mimic this, we make the assumption that |x pos ∪x neg | << |F|.

We assume access to a labeled data set of N clinical cases,

Soft Cross-entropy loss: The goal is to learn a function from g : X → Y that minimizes empirical risk:

where loss measures the discrepancy between the true label distribution ddx (n) and the predicted distribution g(x (n) ). During evaluation we use 0/1 loss and during training, we use the surrogate soft cross entropy loss between encoding of y n as a probability distribution and the output softmax probability vector provided by the function g.

In a typical classification cross entropy loss, the target is a vector with one non-zero entry corresponding to the ground truth label. However, often in clinical diagnosis setting, there is always some amount of uncertainty in the final diagnosis. As an example, for a patient with "cough", there are many possible diseases to be considered with some diseases such as "common cold' or 'flu' having a higher probability than diseases such as 'pneumonia'. In order to capture this, the target vector is a distribution over the plausible diagnoses, where the probability is based on the scores associated with the differential diagnoses. In Mahajan et al. [2018] , a similar construct was used in the context of image classification but assumed a uniform distribution over the target labels, and hence set the target to 1/k corresponding to the k ≥ 1 labels for the image.

We acknowledge that even though the loss function models the entire differential diagnosis, we would ideally have a loss function that is cost-sensitive. For instance, misdiagnosing a disease that [2015] and the corresponding transformation to the findings in the expert system. Column 3 correspond to the ground truth label provided. Columns 4-5 provides the top 5 diseases predicted by model trained from two configurations of the model. needs urgent care can lead to worse outcomes than misdiagnosing a common cold. This risk in misdiagnosis should ideally be taken into account.

Model: Functional form of g(x): Fig. 2 provides an overview of the model for computing g(x) in eqn. 1 . The model takes as input x pos , x neg and predicts a distribution over the diagnosis. The model has separate input streams for the demographic variables and the findings. Demographic variables include gender and age. They impose an implicit prior over diseases that are impossible. As examples, it is impossible for a biologically male patient to be diagnosed with 'pregnancy'. Similarly, its unlikely for an infant to be diagnosed with 'dementia'. Incorporating these medically grounded priors into the model can facilitate faster model training as they can serve as the bottle neck over possible diagnoses. To reuse the 'pregnancy' example, in the early iterations of training, a male patient with nausea and vomiting can be forced to have 'zero' probability mass over women-related health issues, thereby greatly reducing the parameter search space. To model this, the embedding layers of demographic variables are assumed to be independent, and they have their separate Ldimensional space. The L dimensional embedding vector is used to capture the prior probabilities of these variables to the set of L diagnoses. We assume equal prior over all diseases that are plausible for a demographic variable. During training, the embeddings of the demographic variables are updated to learn better representations.

We use separate embeddings for presence and absence of findings; the presence of a finding has a distinct role to play than its absence. We use an embedding space of high dimensionality, followed by dropout for regularization, and then averaged and projected to a fully connected layer and a logsoftmax activation function. As we can see from the model architecture (Fig. 2) , the demographic variable encodings are combined additively with the averaged embedding of the findings after a logsoftmax transformation. This enables the bottleneck property/role that we described above.

Evaluation dataset: We evaluate the model performance using two datasets:

• Semigran: This a public dataset that was made available as part of study in Semigran et al. Semigran et al. [2015] where over 50 online symptom checkers were evaluated. The dataset consists of 45 standardized patient clinical vignettes, corresponding to 39 unique diseases. We used the simplified inputs provided along with the clinical vignettes, as previously used in other studies Razzaki et al. [2018] , Kannan et al. [2020] . • COVID-Assessment: In order to quantify whether COVID-19 will be included in the differential for cases that are observed to have risk for COVID-19, we used subset of data from the assessment tool as described in § 3.2. We construct two test sets -one that uses all the findings in the assessment and the other that restricts to findings that are in the universe of clinical case simulation. This ensures us that the model can infer relationships that are not just those with the findings tabulated in tbl. 3.

We are interested in a metric that is valuable in deployment contexts; in particular, to aid doctors and patients in deriving the differential diagnosis so that the relevant diagnoses are considered within a small range of false positives. For this purpose, we report top-k accuracy also known as recall@k (k ∈ {1, 3, 5}), or sensitivity in medical literature.

where [a = b] is the Iverson notation that evaluates to one only if a=b or else to zero.ŷ (t) [j] is the j th top class predicted from a model when evaluating test case t.

In order to explicitly capture the sensitivity of the model to COVID-19, we use

for measuring model performance on cases from COVID-19 assessment data.

Model variants: We consider three model variants, based on the training dataset:

• Ours-BASE: This is our base model that does not have any COVID related data. The only training data is the simulated data from only expert systems. This model is used to compare performance of the model on publicly available test sets, and to re-establish that expert systems can be modeled as data prior through simulation. • Ours-BASE-COVID: This model combines the simulated data from expert systems with the COVID assessments data. In this setting, while COVID-19 is added as an additional label, the universe of findings F is maintained to be same as Ours-BASE. • Ours-BASE-COVID-FULL: This is same as Ours-BASE-COVID except that all the symptoms including travel history and exposure risks are used as inputs. The full list of additional findings used is provided in tbl. 3

Model parameters and training details: Embedding vectors for the demographic variables are initialized by constructing their prior distribution using expert system. For label corresponding to COVID-19, we assume uniform prior over both demographic variables. 1024 dimensional embedding vectors for the findings are initialized randomly in the range [-.05,.05 ]. Dropout of .7 enables regularizing the model sufficiently to learn better representations. The model is trained with minibatches of size 512 using ADAM with initial learning rate of 0.01. All parameters of the model are updated at each step of the optimization.

The goal of this first set of experiments is to re-establish that we can learn from the simulated data of the expert systems. Table 7 : Comparison of performance between models on the COVID assessment data.

In particular, Fraser et al. [2018] is human evaluation where twenty medical experts studied each case in entirety (some cases include more information such as labs that are not available in patient facing applications) and came to consensus. We can see that the top-3 accuracy is still not at 100%, showcasing the difficulty of agreeing on diagnoses even by human experts. We also want to call out that at the time of publication of et al. Semigran et al. [2015] , the average performance of the online symptom checkers on Semigran dataset was at 50% in top-20. In Razzaki et al. [2018] , results were provided for only 30 clinical cases. We extrapolated assuming remaining 15 cases were wrongly diagnosed so that top-1 accuracy is at 46.6% and top-3 and 64.67%. In Kannan et al. [2020] , the training set consists of same number of diseases as the test set, while we consider a much larger label space of diseases. We make the following observations based on tbl. 4:

• Ours-BASE performs best across all models, closing the gap with the AI expert system that was used to simulate the dataset. This re-establishes that the approach of using expert systems as a data prior continues to hold in new settings, with different datasets and machine learning model ( § 3 ) • Adding extra disease label (COVID-19) does not deteriorate the performance as evidenced by Ours-BASE-COVID that has no additional findings as input. The drop in top-k accuracy can be attributed to the fact that there are overlapping findings that are to be reasoned.

As an example, consider first example in tbl. 5. Ours-BASE-COVID includes COVID-19 in the differential because of the overlapping findings between viral respiratory infections and COVID-19. In contrast, in second example, with symptoms such as neck stiffness, severe headache and photophobia, the model continues to maintain its prediction to be more close to the differential diagnosis of Ours-BASE.

• Ours-BASE-COVID-FULL with additional COVID-19 findings do not change the prediction accuracy as much as Ours-BASE-COVID, which is to be expected.

Here, we are interested in understanding the performance of the model on COVID-Assessment dataset. The goal is to measure the extent to which the learned models Ours-BASE-COVID and Ours-BASE-COVID-FULL can capture COVID-19 in the differential diagnosis. Table. 7 compares the performance. While Ours-BASE-COVID is learned by constraining input findings to those that are in Ours-BASE, the model is able to include COVID-19 in the differential diagnosis in 73% of the cases. This shows that the model is able to capture overlapping findings between COVID-19 and other diseases. When we examined cases where COVID-19 was not part of the differential diagnosis, we found these to mainly correspond to those cases where input observations (findings) predominantly come from tbl. 3 that are related to travel and social distancing factors which the model is unaware of. In contrast, as Ours-BASE-COVID-FULL, encodes all these findings, it has COVID-19 in 100% of the assessment data (table 3 indicating ability to discriminate COVID-19 from the rest of the diseases.

Tbl. 6 provides qualitative examples comparing the three models on the data from COVID-19 assessments. These are cases with low to moderate risk of contracting COVID-19. Ours-BASE and Ours-BASE-COVID considers only the findings in column 1 as inputs, while Ours-BASE-COVID-FULL use the union of the first two columns as input. In the final example, as observed in Semigran cases, Ours-BASE-COVID differential diagnosis is consistent with Ours-BASE. However, once the model gets an added input of the patient being a healthcare worker, Ours-BASE-COVID-FULL includes COVID-19 in one of its top 5 position.

In this paper we have presented a novel approach to quickly enhance a diagnosis model that is effective even in an extreme situation when a new previously unknown condition appears and compromises prior medical knowledge. Our approach combines the strengths of two very different AI formulations: the "old-school" traditional expert systems, with state-of-the-art deep learning models. We leverage expert systems as a way to input prior knowledge into the learned model as synthetic data, and use deep learning to learn a generalizable model on the combination of old and new data. Our model is able to capture the nuances of a new condition like COVID-19 without losing the pre-existing medical knowledge accumulated in the expert system.

Our paper also demonstrates the efficiency of the approach even in a situation where there is little data to train on for the new disease. In order to do so, we leverage synthetic data created by the expert system used as a simulator, and we add real usage data of an online COVID-19 assessment tool.

DXplain: An Evolving Diagnostic Decision-Support System

Rule-Based Expert Systems: The MYCIN experiments of the Stanford Heuristic Programming Project

Safety of patient-facing digital symptom checkers. The Lancent

Knowledge acquisition: Past, present and future

Virtually Perfect? Telemedicine for Covid-19

The accuracy vs. coverage trade-off in patient-facing diagnosis models

Evaluation and accurate diagnoses of pediatric diseases using artificial intelligence

Diagnostic inferencing via improving clinical concept extraction with deep reinforcement learning: A preliminary study

Ashwin Bharambe, and Laurens van der Maaten. Exploring the limits of weakly supervised pretraining. CoRR

Internist-1, an experimental computer-based diagnostic consultant for general internal medicine

The INTERNIST-1/QUICK MEDICAL REFERENCE project-status report. The Western journal of medicine

Quick medical reference (qmr): A microcomputer-based diagnostic decision-support system for general internal medicine

Deep Patient: An unsupervised representation to predict the future of patients from the electronic health records

Recognition networks for approximate inference in BN20 networks

Creation of realistic appearing simulated patient cases using the internist-1/qmr knowledge base and interrelationship properties of manifestations

Scalable and accurate deep learning for electronic health records

Modeling principles for QMR medical findings

Learning from the experts: From expert systems to machine learned diagnosis models

A comparative study of artificial intelligence and human doctors for the purpose of triage and diagnosis

Evaluation of symptom checkers for self diagnosis and triage: audit study

An empirical analysis of likelihood-weighting simulation on a large

Deep EHR: A Survey of Recent Advances on Deep Learning Techniques for Electronic Health Record (EHR) Analysis