key: cord-0047962-qx7t1k5p
authors: Herriman, Maguire; Meer, Elana; Rosin, Roy; Lee, Vivian; Washington, Vindell; Volpp, Kevin G.
title: Asked and Answered: Building a Chatbot to Address Covid-19-Related Concerns
date: 2020-06-18
journal: NEJM Catal Innov Care Deliv
DOI: 10.1056/cat.20.0230
sha: a1fb0cc27f3d145581928ab711791c98bb83a042
doc_id: 47962
cord_uid: qx7t1k5p

Through a collaboration with Verily and Google, Penn Medicine was able to leverage machine learning and natural language processing to develop and launch an interactive tool to help patients get answers to their questions and an assessment of symptoms related to coronavirus. This is now publicly available worldwide through the Google Contact Center AI initiative.

Across a wide range of industries, chatbots have been an efficiency-enhancing way for business teams to interact with their customers. Chatbots are conversational agents that leverage machine learning and natural language processing to understand intent in order to reply with appropriate answers, and they have advantages relevant to the present crisis.1 -3 First, they are accessible any time, allowing patients to obtain answers 24/7 and to avoid wait times on hold before reaching a human. Second, chatbots have a higher handling capacity than any human.4 A single chat bot can simultaneously have conversations with thousands of people no matter what time of day. Regardless of the volume of calls, every question may be answered immediately.

Less obvious benefits may be the fact that chatbots offer the potential to standardize answers, consistency that can be difficult to achieve in the context of significant numbers of people rotating through phone line staffing assignments. Users seek quick, accurate responses when searching for information or assistance, and this productivity has been demonstrated to be one of the most important reasons organizations deploy chatbots.5 While chatbots require ongoing training to achieve accuracy, the ability to learn as they go, incorporating shifts in conversation structure and flow, can help address the fluidity and variability of language used in describing health related needs, in both normal and dynamic circumstances. 6 Bots can also interact in many different languages. Of course, bots are still machines where theI inAI is not yet equal to human cognitive abilities, and thus work in this arena requires deliberate and transparent validation methods, finesse with disclaimers, and institutional tolerance for imperfection and post-release improvement.

Chatbot diversity is derived from frameworks that vary based on type, direction, guidance, interaction style, and communication channel.7 Despite this diversity of function, most chatbots perform mundane tasks in a predictable fashion, with less evidence on how they might handle the level of variation and complexity in medical questions and responses arising in the time of a pandemic. 8 In this study, we outline the creation of a Penn Medicine chatbot collaboratively created with Verily, Google Cloud, and Quantiphi, a Google Cloud strategic partner. (Penn Medicine includes the Perelman School of Medicine and six hospitals and hundreds of outpatient centers throughout Southeastern and Southern New Jersey.)

When we chose this direction, a number of organizations had already deployed off-the-shelf chatbots. We recognized the potential value of these tools, as many questions -like Should I wear a mask? -should produce consistent answers. Our analysis of incoming patient questions, though, revealed the need to address institution-specific queries, such asWhere should I go to get tested? orHow do I get my test results? Even interactions that might logically be standardized, such as symptom checkers, must be crafted in a manner consistent with the capacity, capabilities, and pathways of the health system using it to translate information into desired patient actions while responsibly managing constrained resources. This need for unique response mapping, complex contextualization, and dynamic, clinician-guided validation of content required us to develop a specialized chatbot (Figure 1 ).

The process of creating the FAQ bot was initiated by first determining the most frequently asked questions to assess which content would best offload volume from care team members. Using a database of Covid-19-related telephone encounters with the Penn Medicine phone lines and patient-submitted secure messages to our patient portal, a team of medical student volunteers analyzed a sample of 800 telephone encounters and 450 secure messages from the early stage of the epidemic in Philadelphia region (3/19/20-3/22/20) and categorized them by topic, sub-topic, and main question. These questions were then re-ordered by frequency and further sorted into general categories of Covid-19-related/Infectious Diseases-specific questions (e.g., How long am I infectious for? What do I do if I have a fever?), Oncology/ Immunosuppressed-specific questions (e.g.,Am I at higher risk on immunosuppressants? Should I discontinue my immunosuppressants?), Occupational Health-specific questions (e.g.,I had an exposure; should I go into work?), and Logistics/ Operational-specific questions (e.g.,How do I get tested?). By this process, we generated an initial list of 97 high-frequency questions. Beyond the frequency of questions, it was important to leverage patient-submitted messages to understand the phrasing patients used in making inquiries. This language became a critical element in training the bot, providing a foundation for the technology to begin learning how to link various phrasings for a question to the underlying intent and, therefore, an appropriate answer. The next step in the content process was formulating answers to these questions. With a team of medical students, information from PubMed-searched studies, recommendations from the Centers for Disease Control and Prevention and the World Health Organization, reputable news outlets, and Penn clinicians, we crafted answers to the extracted questions.

Two of the key challenges to overcome were variation in information across a large, complex health system and rapidly changing guidance due to the reality of both continually evolving evidence and changing conditions, such as where tests or services were available."

This content then required validation and the creation of a system for ongoing answer verification. Two of the key challenges to overcome were variation in information across a large, complex health system and rapidly changing guidance due to the reality of both continually evolving evidence and changing conditions, such as where tests or services were available. As new information became available and was assessed and contextualized for our system, our team had to insert ourselves into existing information distribution channels to avoid creating failure points where content could fall out of sync. Translating the information into clear answers for a consumer audienceand applying what was known to a broad range of contexts of interest to patients, such as whether aerosols staying suspended in the air mean it's not safe to go to the grocery store or to stand where an infected person might have coughed hours ago -was also important. We created a content team with fellows and attendings from infectious disease, oncology, occupational health, primary care, and operations leaders along with a deliberate process for drafting and approving answers.

The attempt to ensure consistency in providing rules-based, algorithmic answers revealed opportunities for policy alignment across our hospitals, from whether a spouse could go in with a patient having surgery to occupational health rules for our employees. Close engagement by our operations and clinical leadership teams enabled rapid resolution of discrepancies. Where differences were appropriate, we had to identify questions in the bot requiring divergent answers, triggering follow-up questions for clarity and branching once the person provided more information, such as whether they were in Lancaster or Philadelphia or were a Penn Medicine employee or not.

We identified two critical contexts and audiences who needed help securing fast, accurate answers to coronavirus related questions: nurses on our phone lines seeking support when addressing patient questions and patients interested in self-serve answers via the chatbot. The needs of these " groups clearly differ. Nurses preferred to quickly access all the information on a topic, consider alternative answers, and select the response they found most appropriate. Thus, we created an internal search tool that would search the FAQ database and display all relevant question-answer pairs ranked by relevance. This search interface eliminated the time and effort previously required to scroll through many pages hunting for relevant content. We incorporated input from the Penn Medicine Patient and Family Advisory Council in the bot testing and feedback process pre-launch and strove to provide a single best response to reduce ambiguity for patients.

A central challenge in enabling the desired patient experience required addressing the reality that people phrase and frame questions in seemingly endless ways. Myriad word choices can represent the same intent. For instance, Should I keep taking my Humira?;Are immunosuppressant medications safe?; andDoes my lupus drug increase my risk of Covid-19? all express patients' need to know whether one should alter their regimen of immunomodulating medications during the pandemic. Partnering with Google presented the opportunity to leverage natural language processing and machine learning in pursuit of mapping diverse inquiries to common underlying intents. The Google team offered technical acumen, resources, and their Dialogflow technology, where we could establish validated answers for high-volume questions and map varying observed language representing the intended meaning to initiate training the bot, a process that could then be amplified and accelerated by Google's machine learning capability. For each question-answer pair, our teams collaborated to provide 10-15 training phrases consisting of alternate ways a patient might phrase a question driving at the same intent. After training the bot on these phrases, it was then able to recognize many more phrasings, driving selection of the appropriate response.

For each question-answer pair, our teams collaborated to provide 10-15 training phrases consisting of alternate ways a patient might phrase a question driving at the same intent. After training the bot on these phrases, it was then able to recognize many more phrasings, driving selection of the appropriate response."

Testing and Going Live: The urgency and importance of the crisis attracted support from across the Penn Medicine community, enabling a crowdsourced testing effort. In several rounds of internal testing leveraging medical students, faculty, non-clinical staff, and the patient and family advisory council, we collected more than 500 feedback submissions in three days, ranging from missing answers to clarifying edits and new ways people worded questions that the bot would need to learn. This input allowed us to quickly make improvements by altering and adding responses and teaching the bot by adding training phrases based on how testers structured inquiries. Prior to launch, we established clear and measurable criteria for going live, based on industry standards for accuracy of answers, and then tested the bot on 150 patient-submitted questions from our patient portal. In the first round of testing the bot did not perform well enough, reaching an accuracy of 61%. On review, we discovered that this was due to a small subset of questions that were not yet included in the bot's training data. Therefore we added the relevant content and training phrases. In the subsequent testing phase, the bot produced a sufficient response 75% of the time ("sufficient" defined as " correctly answering questions to which it knew the answer and recognizing questions to which it did not and providing an appropriate fallback response), meeting our launch criteria.

While we still had to set clear expectations with health system leaders that people wouldn't receive their desired outcome in one-quarter of interactions, the combination of ongoing manual training based on patients' and clinicians' incoming questions and machine learning algorithms supporting the bot would enable it to improve with use. Based on this, we decided to proceed with launching on our institution's Covid webpage on April 9, 2020, two weeks after our initial planning meeting with our industry partners.

In addition to building a chatbot to provide accurate information to patients, we also equipped it with the ability to triage symptomatic patients to the appropriate level of care. Staying focused on the goal of offloading volume from phone lines and reducing clinician burden, analyses of incoming questions revealed that people reporting symptoms and seeking direction on what to do was quite common. Patients would be offered the triage tool when first opening the bot and whenever they submitted a question that suggests they are experiencing symptoms. Similar symptom checkers already existed on several platforms, provided by startups and larger technology vendors. However, we found we needed to both customize our approach to appropriately triage patients to the avenues for receiving care at our institution, mirroring system protocols, and have back-end control of the content and algorithm for rapid adjustments as the pandemic evolved ( Figure 2 ).

Attaining buy-in on the topic of automated triaging of patients required extensive dialogue with clinical leaders. Several key stakeholders initially expressed reluctance to triage algorithmically without an individualized clinical assessment performed by a trained provider. Valid concerns were raised, most notably about how a bot could both understand the nuances of conversation and communicate as well as a seasoned clinician, and regarding liability in the event of inappropriately triaging patients. The concern was raised that many patients would likely prefer to speak directly to a clinician, so that option was always preserved. Still, most patients, about two-thirds of a sample of 2,400, agreed to proceed with the bot after the disclaimer message, and about 90% of those finished to an endpoint.

Rather than train a bot to understand all the vagaries of a patient describing symptoms, we constructed an algorithm that mirrors what seasoned clinicians do: distill complicated situations down to their most crucial elements and make decisions based on those factors. While any number of additional data points can be useful to a clinical provider, we chose to focus on only those questions that would influence one's disposition decision and gave users binary answer options.

Our work started with a helpful meeting with Tim Judson, MD, MPH, from UCSF's Clinical Innovation Center. He and Ralph Gonzales, MD, MSPH (Associate Dean for Clinical Innovation and Chief Innovation Officer for UCSF Health), and their team had implemented an algorithm for triaging Covid and influenza-like illness patients that they deployed through their EHR-linked patient portal. Using these questions and disposition endpoints as a foundation, we convened a team of clinicians at our institution to edit the questions asked, triage algorithm, and four disposition endpoints to match our patient population, priorities, and capabilities. Additionally, we restructured the order in which questions were asked to prioritize appropriately triaging patients in as few questions as possible (algorithm shown in Appendix). For instance, take a 50-year-old male with unrelenting chest pain. Eventually a provider team will want to know whether this patient has a history of hypertension, but that information will not change the fact that he needs an emergent evaluation and can be triaged straight into the emergent evaluation category, no further questions asked.

A bot cannot yet replicate the art of medicine, so we positioned the triage tool's role as expediting the process for patients to access the level of care they needed and prioritizing our human resources for where they could have the greatest impact. The context and magnitude of the Covid-19 outbreak provided a catalyst for exploring alternate methods to triage patients, particularly those who are lower risk or the worried well. As noted above, bots aren't limited in how many patients they can serve simultaneously, nor are they susceptible to fatigue in the case of high volume.

On the risk of inappropriately triaging patients, involving clinical leaders in the algorithm's refinement was critical to assuage concerns. The nurse leaders from our Covid hotline also provided valuable input on clarifying and simplifying language, having a wealth of experience from live phone conversations. We intentionally made the algorithm conservative in erring toward recommending either remote or in-person evaluation as opposed to self-care at home. We also included anticipatory guidance at each disposition endpoint, encouraging patients to seek care if they remained concerned and to repeat the tool or contact their providers if their symptoms change, worsen, or fail to improve. While all stakeholders were comfortable with the triage algorithm for those placed into the highest and lowest severity groups (emergent and low risk for complications, respectively), achieving convergence about how best to handle the two middle " groups (those requiring an urgent or non-urgent evaluation) was more challenging. For these groups, initial deployment included validation of the bot's triage decision with an intentionally redundant clinical evaluation by a triage nurse to enable comparison of the decisions made by the bot and the triage staff. We agree with our colleagues who have called for methodical validation of facilitated self-service tools to assess safety and effectiveness. 9 We've structured the hand-off from the bot to our clinicians to enable a follow-up analysis comparing the bot's determination to that of clinical judgment.

Providing continued accurate, consistent information regarding Covid-19 requires frequent updating, given the steady evolution of information and policies. We developed two processes for managing updating: one for new information and one for ongoing review of previous content.

For the first process, ongoing content generation occurred similarly to the initial content generation. New frequently asked questions were extracted from ongoing communications through our patient portal, news articles, CDC/WHO recommendations, and daily reports of frequently asked questions coming into Penn Medicine phone lines. Post-launch, we also used user-submitted questions to guide topics for new content generation. Medical students drafted answers, which were then reviewed by fellows and attendings. This system was organized into a twice-weekly review. Fellows and attendings would update the project manager after review and the project managers migrated verified FAQs to the production version of the Google bot. In addition to this process, there was a more immediate process for adding new operational updates. CDC and Penn Medicine-specific updates continue to be tracked in real time. The project manager then drafts and forwards a response for immediate review by the relevant reviewer, especially for expedited execution of logistics or operations content change, such as when testing or safety protocols change.

The chatbot automatically detects and flags for analysis intents (phrasings with similar meanings) from users that go unanswered. High-frequency unanswered questions can be analyzed in a rapidly evolving clinical situation to produce more question/answer pairs on a near real-time basis. Phrases with similar meanings can be added to an intent if local, colloquial phrases are newly identified after deployment, allowing for better answers next time.

The key questions to answer in assessing newly deployed technology typically ask whether people use it, whether they like or recommend it, and whether it moves the targeted needle to change the outcome. In this case, we added metrics related to accuracy to ensure consistently highquality responses. We also chart accuracy weekly to enable us to measure the degree of ongoing improvement.

Usage can be assessed based on a number of within-bot metrics, including: number of users, average number of questions asked per conversation, most frequent responses given by the bot, and the percentage of fallback responses where the bot did not recognize or know the answer to a question.

Patients' experience with the bot can be measured by directly asking for feedback. We integrated a permanent feedback link in the user interface visible at all times to a user. We've also contemplated ending sessions by directly asking for feedback, which we have not yet deployed due to design considerations including not wanting to bother users at a potentially stressful time.

Call volume hitting our clinicians and wait times affecting our patients were key outcome metrics for initiating this project. As the pandemic unfolded, the health system was able to re-deploy clinicians to phone lines in a manner that handled incoming volume effectively, such that wait times never became an issue of concern. That being said, as we return staff to their original clinical roles, and in preparation for any forthcoming surge or crisis, we will continue to examine the bot's potential to offload volume.

Accuracy remains of utmost importance in establishing credibility and earning leadership confidence for broader deployment of bot technology. We're able to automatically track the percent of questions that do not result in an answer, and this improved steadily from 33.1% during initial testing pre-launch to 12.7% in the most recent week for which we have data (5/24-5/30/20), but that's only part of the equation. We also need to assess how often the answer produced does not address the question intended, which requires a qualitative assessment of logs recording questions asked and answers given.

To round out evaluation, the final critical piece will be the degree of concordance between the triage decisions made by the bot and those made by clinical personnel. Other important measures include the number and percentage of patients triaged to each disposition (emergent, urgent, nonurgent, self-care at home) and tracking the outcomes for patients who call into our Covid hotline from the triage tool and are then followable in our EHR. These are important not only for research and monitoring, but also for altering the triage algorithm as needed.

The images of overwhelmed health care systems in Italy, Spain, and New York City have highlighted the importance of leveraging novel tools to reduce the demands on front-line clinical personnel. Bots have made substantial inroads in many industries due to the efficiencies that can be achieved but have not been as widely adopted within health care settings. Much of this relates to the culture of medicine -the notion that an in-person evaluation with a clinician is the best way to diagnose and treat a patient. However, as we are seeing with as much as 90% of outpatient visits becoming virtual in some systems, there are substantial shifts underway in how we collectively view what is ideal in terms of diagnosing and treating patients. Tools such as bots can also be adopted with low incremental effort.

As we are seeing with as much as 90% of outpatient visits becoming virtual in some systems, there are substantial shifts underway in how we collectively view what is ideal in terms of diagnosing and treating patients. Tools such as bots can also be adopted with low incremental effort."

To facilitate wider-scale adoption, Verily and Google Cloud have collaborated to create an opensource Pathfinder virtual agent template, which is freely available to health systems and hospitals. This facilitates creation of chat or voice bots that answer questions about Covid-19 symptoms and provide the latest guidance from public health authorities like the Centers for Disease Control and Prevention and World Health Organization. Contact Center AI's Rapid Response Virtual Agent program is available around the world in any of the 23 languages supported by Dialogflow.

Part of the appeal of a bot is that it is instantly available 24/7 to meet patient demands for support and poses a compelling, convenient option if in fact it gives clear, accurate answers. We have already seen the potential to standardize rigorously researched answers to eliminate concerns about variability in responses. Some of the questions bots can answer are mundane logistical questions but many of these are highly important to patients, such as the location of parking or how to get test results. Others are clinical questions where self-service customization is possible due to available technology capabilities. Deployment of bots during this crisis may be done to prevent clinical and clerical staff from being overwhelmed by inbound call volume; in time, however, health systems may also find it helpful to either serve larger populations of patients without adding staff or to reallocate staff to roles of providing clinical care best done by humans and letting bots handle the algorithmic tasks.

Bots can make information available at a scale well beyond telemedicine approaches due to automation and this can provide support for people who cannot afford care, who can't communicate well in English, or who prefer anonymity." Some of this may be uncomfortable. Medicine has always prided itself on being a people business where direct contact between humans was paramount. Perhaps because of this, the technologyenabled savings in costs that have characterized most other industries have largely evaded medicine, and health care costs in the U.S. have risen well above the level of affordability for most American families. Bots can make information available at a scale well beyond telemedicine approaches due to automation and this can provide support for people who cannot afford care, who can't communicate well in English, or who prefer anonymity. These approaches can be developed to tie to institution-specific resources and that will enhance customer service by providing the best answer to patients whenever they want it 24/7 without having to wait and without variability based on who answers the phone. 

The Return of the Chatbots

How Software Developers Mitigate Collaboration Friction with Chatbots

Survey on Chatbot Design Techniques in Speech Conversation Systems

Chatbot: efficient and utility-based platform

Why people use chatbots

Evaluating quality of chatbots and intelligent conversational agents

A framework for understanding chatbots and their future

Real conversations with artificial intelligence: A comparison between human-human online conversations and human-chatbot conversations

Toward Facilitated Self-Service in Health Care

The authors wish to acknowledge the following for their assistance: