key: cord-0732631-o3bsyjvx
authors: Sulieman, Lina; Robinson, Jamie R.; Jackson, Gretchen P.
title: Automating the Classification of Complexity of Medical Decision-Making in Patient-Provider Messaging in a Patient Portal
date: 2020-06-19
journal: J Surg Res
DOI: 10.1016/j.jss.2020.05.039
sha: 361c2382ba1b412373db36a9fa0ffbe25c74031c
doc_id: 732631
cord_uid: o3bsyjvx

BACKGROUND: Patient portals are consumer health applications that allow patients to view their health information. Portals facilitate the interactions between patients and their caregivers by offering secure messaging. Patients communicate different needs through portal messages. Medical needs contain requests for delivery of care (e.g. reporting new symptoms). Automating the classification of medical decision complexity in portal messages has not been investigated. MATERIALS AND METHODS: We trained two multiclass classifiers, multinomial Naïve Bayes and random forest on 500 message threads, to quantify and label the complexity of decision-making into four classes: no decision, straightforward, low, and moderate. We compared the performance of the models to using only the number of medical terms without training a machine learning model. RESULTS: Our analysis demonstrated that machine learning models have better performance than the model that did not use machine learning. Moreover, machine learning models could quantify the complexity of decision-making that the messages contained with 0.59, 0.45, and 0.58 for macro, micro, and weighted precision and 0.63,0.41, and 0.63 for macro, micro, and weighted recall. CONCLUSIONS: This study is one of the first to attempt to classify patient portal messages by whether they involve medical decision-making and the complexity of that decision-making. Machine learning classifiers trained on message content resulted in better message thread classification than classifiers that employed medical terms in the messages alone.

Patient portals are secure online applications that allow healthcare organizations to provide patients and their caregivers access to health information including medications, immunizations, and appointments. [1] [2] [3] Many patient portals offer a secure messaging function that enables patients to interact with providers through messages. 3, 4 Secure messaging is one of the most popular features of patient portals, and messaging volumes are growing exponentially, and one study showed that surgery was second only to medicine in number of messages exchanged [5] [6] [7] As the use of messaging increases, techniques to automate the analysis of messages may be critical to assist with triage, message answering, or quantifying the care delivered through patient portals.

Research about portal messages has mainly focused on qualitative analyses of content, with few studies investigating automated classification. 8, 9 North et al. analyzed messages exchanged between providers and patients from a primary care clinic at an academic medical center and found that 3.5% of messages include potential high-risk symptoms. Jackson et al. have developed and validated a taxonomy of consumer health communications and have applied it to questions from patients and caregivers in research and inpatient and patient portal messages. [9] [10] [11] [12] [13] [14] [15] The taxonomy comprehensively describes the semantic types of consumer health communications including informational, medical, logistical, social, and other. It has been employed to characterize the content of consumer health questions (i.e. needs) as well as the answer to those questions. Informational needs are questions that require clinical knowledge such as information about the side-effects of a drug. Medical needs are requests for delivery of medical care, such as the reporting of new symptoms that require management. Logistical needs are questions involving pragmatic issues such as the phone number of a clinic. Social needs are interpersonal communications such as emotional concerns, expressions of gratitude, or complaints. The other category covers content that does not fit into these four categories (e.g. questions that transcend categories like "how do I be a good father" or error messages). Cronin et al. investigated the use of machine learning to classify the content of portal messages. 15 Sulieman et al. trained convolutional neural networks and standard machine learning algorithms (e.g. random forest) to identify the types of needs in portal messages. 16 Portal messages that include medical needs (e.g. timesensitive clinical questions or information reflecting changes in patient status) are of particular importance. 17 One analysis of the content of 3253 patient portal messages from a large academic medical center showed that 72% included medical needs. 15 Answering those types of messages might involve medical decision-making such as changing a drug or ordering a test. As messaging volumes grow, the identification of messages that require medical decision-making may be essential. 6, 7, 18 In this study, we trained machine learning classifiers to identify patient portal message threads that involved clinical decision-making and to classify the complexity of the decision as no decision, straightforward, low, or moderate. This study specifically focused on portal messages exchanged between surgical patients and surgeons as the majority of research on patient portals has been done in medicine and primary care. We investigated the effectiveness of machine learning by comparing the performance of our models to using a medical term extraction tool, Clamp. If effective, such automated message analysis might quantify the care delivered online or support billing for online care.

We conducted the study at Vanderbilt University Medical Center, a private nonprofit institution with 137 outpatient locations and over two million patient visits annually. In 2004, Vanderbilt University Medical Center launched My Health At Vanderbilt (MHAV), a patient portal that offers common portal functions such as access to portions of the electronic health record, appointment scheduling, and tailored clinical information. Secure messaging is one of the commonly used features of MHAV. On average, patients send over 30,000 messages each month. Clinical care teams including administrative assistants, nurses, and physicians typically manage the messages. Clinicians can answer messages directly or delegate message answering to nurses and medical assistants.

This study employed MHAV message threads (i.e. sets of messages exchanged between patients and surgical providers) and annotated them with communication categories from the consumer health taxonomy as well as the complexity of medical decision of the exchange. Two researchers (who were both surgeons) independently labeled 500 message threads with taxonomy categories and complexity of medical decision-making and discussed all disagreements to achieve consensus; details of the data set creation are published elsewhere. 9 The complexity of medical decision-making is one of outpatient billing elements according to the guideline defined by the Center of Medicare and Medicaid Services Evaluation and Management coding criteria. 19, 20 The complexity of medical decision-making is quantified based on three factors: the amount of data reviewed, diagnoses, and risk, summarized in Table 1 . 9 This data set did not contain any message threads with a high level of medical decision complexity.

We trained different classifiers to assign four different labels for each thread: "no decision" for messages that did not involve decision-making, straightforward, low, and moderate. To assign labels, we extracted three text features from the message threads:

1 Bag of words: We extracted the words from each message thread after removing the stop words and nonalphabetical characters. To represent the messages, we TF-IDF is a scoring system that assigns weights for words based on frequency in a document relative to all documents in the data set. 21 TF-IDF weights focus on term frequency and undervalued or rare words. We used Sklearn to transform the bag of words vectors into TF-IDF vectors. 3 Medical terms: We used Clamp (version 1.5.0) to extract medical terms from the message threads and used them as features. 22 We represented each message by the medical terms included in that thread.

We applied three different algorithms to predict the complexity of medical decision-making in the message.

1 Using medical terms only: We established a baseline by using only the number of medical terms in the message threads to assign the complexity label. We labeled message threads using the number of medical terms as follows: i. "No decision" threads included one or no medical terms ii. "Straightforward" decision threads included two or three medical terms iii. "Low" decision threads included four medical terms iv. "Moderate": decision threads included more than five 2 Multinomial Naïve Bayes: We trained a multiclass Multinomial Naïve Bayes model to predict the decision class for the thread using three text features: bag of words, TF-IDF, and medical terms. 3 Random forest: we trained a multiclass random forest model to predict the decision label of each thread.

We split the data set into three sets: 90% for training and validation and 10% for testing. We used 10-fold cross trainingvalidation data set to tune the parameters and select the parameter set of the optimal model. Table 2 lists the parameter space that we searched to identify the parameters of the optimal model. We defined the optimal model as the model that had the highest evaluation metric on the validation set. Because this problem involved multiclass labeling (i.e. more than two classes), we selected two evaluation metrics to identify the optimal model: micro precision and micro recall. The micro metric calculates the precision/recall for each class and finds the weighted average precision/recall based on the number of samples per class which account for class imbalance. In our analysis, we focused on precision and recall (rather than area under the curve) because we wanted to evaluate the model for clinical implementation and thus, wanted to quantify true and false positives and negatives precisely. False negatives represent message threads that were assigned to different decision labels and in some cases no decision. Mislabeling the thread as no decision can have a higher penalty than mislabeling the complexity of message thread as a different level of complexity. Hence, we analyzed the percentage of threads that the classifier mislabeled as "no decision" along with other mislabeling percentages. We calculated the precision for each individual label, the macro precision, and the micro precision by calculating the total true positives, false negatives and false positives, and the weighted precision by calculating the weighted average of precision by their prevalence in the evaluated data set. We calculated the same metrics for recall. Finally, we created a confusion matrix for each model to identify the percentage of mislabeled message threads for each class. Tables 3 and 4 show values of the precision and recall performance metrics for the optimal machine learning models. Table 5 lists the parameters of optimal models. Table 3 summarizes the precision values for the optimal models that we trained on different text features and tuned using precision. Table 4 lists the parameters of optimal models trained in bag of words, TF-IDF, and medical terms. Using medical terms only without machine learning demonstrated generally poor performance. Predicting the type of decision-making using only the number of medical terms had a precision of 0.67 for low complexity, a precision of 0.04 for moderate complexity, and recall of 1.0 for the moderate complexity. Moreover, using the number of medical terms did not identify any of the threads that did not involve a decision.

Applying any machine learning model on the medical terms yielded higher precision for messages that did not require decision-making ranging from 0.50 to 0.67 and for straightforward decision-making from 0.49 to 0.67 versus not using machine learning ranging that had the values 0 and 0.4 for "no decision" and straightforward, respectively. We observed similar results for recall. Machine learning models had higher recall values than using only the number of medical terms by 0.5 to 0.62 for threads that did not involve any decisionmaking and by 0.31 to 0.71 for threads that involved straightforward decision-making. Moreover, applying machine learning classification models on medical terms had higher micro, macro, and weighted values for both precision and recall. The micro, macro, and weight average precision values for methods using only the number of medical terms were 0.14, 0.28, and 0.31 respectively. The micro, macro, and weighted recall values for the same model were 0.14, 0.35, and 0.14.

Applying machine learning models improved the identification of threads that involved decision-making. For the models that we tuned using precision for classifying decisionmaking complexity, the optimal multinomial Naïve Bayes model had higher precision values than the optimal random forest model. The precision values were 0.12, 0.14, and 0.62 when we trained the two models on TF-IDF, bag of words, and medical terms, respectively, where multinomial Naïve Bayes yielded the higher values. As shown in Table 3 , the precision for detecting messages that did not require decision-making Both multinomial Naïve Bayes and random forest models did not identify the one thread that contained moderate decisionmaking. For multinomial Naïve Bayes model, using medical terms had 0.45 and 0.58 for the macro average and weighted average precision, respectively, which were higher than performance metrics for the same model trained on bag of words and TF-IDF. Training multinomial Naïve Bayes on TF-IDF had average micro precision of 0.58, which is the highest overall and the highest values for multinomial Naïve Bayes model trained on the other two features. Tuning the models on the recall values had slightly different results. Multinomial Naïve Bayes had the highest recall overall for identifying threads that did not require decision-making with a recall of 0.62, which was higher by 0.12 than the recall yielded by random forest. Training the multinomial Naïve Bayes model to identify threads that required low decision-making had the highest recall value, which was 0.60. Random forest trained on bag of words and TF-IDF to identify threads with straightforward decision complexity had higher recall values compared with multinomial Naïve Bayes models trained on the same text features. Moreover, the random forest model trained on bag of words had the highest micro, macro, and weighted recall values.

The ability to identify the complexity of decision-making in a thread also depended on the text features used to train the models. Training the model on bag of words yielded the highest values of 0.67 for precision, and 0.62 and 0.88 for recall for identifying "No decision" and straightforward classes. Using the medical term to identify threads that contained low to moderate decision-making had the highest precision values (0.67 and 0.04), the highest recall values (0.6 and 1). For the aggregated macro and micro metrics, training the models on TF-IDF had the highest micro precision, while training the models on medical terms had the highest macro and weighted precisions. Training the models on bag of words had the highest macro, micro, and weighted recall. Figures 1 and 2 show the confusion matrices for the models tuned on precision and recall, respectively. The confusion matrices specify the rates of true classifications and misclassifications with respect to the other classes. Each row represents the true positives and false negatives for the corresponding classes. For example, the second row details the classification of threads that involved straightforward decision-making. The first, third, and fourth columns represent the straightforward threads that the model classified as no decision, low complexity, and moderate complexity. All cells on the diagonal represent the messages that were classified correctly also known as true positive rate or recall. Darker colors correspond to higher values. The confusion matrices for the models tuned on precision are depicted in Figure 2A -F. When we tuned the model using the precision, training Multinomial naïve Bayes on bag of words or TF-IDF had the 0.62 which was the highest recall. The recall for message threads with straightforward decision-making was the highest when we trained random forest and ranged between 0.79 and 1.0, where training random forest model on medical terms achieved a complete identification of straightforward complexity decision-making. The Multinomial naïve Bayes classifier trained on medical terms identified 0.38 of threads classified as requiring low complexity decisionmaking correctly, which was the highest recall among all models.

The models misclassified the message threads that contained moderate complexity decision-making. For all the models trained on three different text features, the threads that did not contain any decision-making were most commonly mislabeled as straightforward with rates between 0.19 and 0.62 and random forest trained on medical terms mislabeled all of them as straightforward (percentage of labeling "no decision" as straightforward ¼ 1, see confusion matrix in Fig. 1F , the second cell in the first row). The straightforward message threads were typically mislabeled as no decision with a rate of misclassification ranging between 0.12 and 0.17 in all models except random forest trained on medical terms. The threads in the low complexity class were mainly mislabeled as straightforward with 0.5 to 0.75 except random forest trained on medical terms that mislabeled all low complexity threads as straightforward (confusion matrix Fig. 1F , third cell in the third row). Figure 2A -C depicts the confusion matrices for Multinomial Naïve Bayes trained on bag of words, TF-IDF, and medical terms. Figure 2D -F depict the confusion matrices for random forest trained on bag of words, TF-IDF, and medical terms, respectively. The correct classifications and misclassifications were slightly different for the models tuned based on recall ( Table 4 ). The highest recall values among all models were 0.62 for the "no decision" class, 1 for straightforward, and 0.38 for low complexity when we trained multinomial Naïve Bayes on bag of words or TF-IDF, random forest on medical terms, and multinomial Naïve Bayes on medical terms, respectively. The no decisions message threads were mainly misclassified as straightforward with 0.19 as the lowest misclassification rate for multinomial naïve Bayes trained on bag of words and 1.0 as the highest misclassification rate for random forest trained on medical terms. The message threads that contained straightforward decisions had higher misclassification rates when we trained random forest, and multinomial naïve Bayes on bag of words with rates ranged between 0.08 and 0.21. While training the machine learning model on TF-IDF resulted in misclassifying straightforward message threads as "no decision" with rates 0.17 and 0.21 for multinomial naïve Bayes and random forest. The threads in the low complexity class were misclassified as straightforward with 0.50 misclassification rate for most of the models.

This study is one of the first attempts to automatically classify patient portal message threads exchanged between surgeons and patients based on the complexity of medical making decision within the message exchanges. To our knowledge, it is the first study to implement machine learning models to identify message threads that involved medical decisionmaking from a healthcare provider and to classify the complexity of the decisions. It is well established that medical coding criteria are applied inconsistently in practice, 23, 24 and the annotation of the data set for this study required careful analysis and discussion to achieve consensus to create a highquality gold standard. Our analysis shows that using tools that only extract the medical terms such as KnowledgeMap, cTake, or Clamp were not efficient in quantifying medical decisionmaking. 22, 25, 26 Machine learning models improved the classification of patient portal message threads based on the complexity of medical decision-making. Automating the classification of individual patient messages may aid with triaging those messages that need the attention of a healthcare provider who can respond and deliver the appropriate care. Analyzing the content of message threads has the potential to support automated billing for online encounters. However, to realize these applications, the performance of these classifiers will need to be improved. This manuscript provides some initial evidence on which approaches may be most effective.

To evaluate the effectiveness of implementing the proposed classifier in clinical settings, we focused on precision and recall metrics. Obtaining a precise model to identify message threads that do not involve medical needs or decision-making could aid in triage to administrative assistants or allied health professionals. Our analysis demonstrated that machine learning models could accurately identify the message threads that do not involve medical decision-making or contain straightforward decisions that have minimal complexity. Such messages might be triaged to administrative assistants, nurses, or allied health professionals, allowing physicians to focus more time on messages requiring more complex medical decisions.

Developing a machine learning model with high recall could support the identification of threads with higher complexity medical decision-making, which could potentially be valuable for healthcare administrators in quantifying the care being delivered by providers online and potentially supporting automated coding of online outpatient encounters, should reimbursement be supported. Message threads that have straightforward to moderate decision-making can include new symptoms or clinical problems, which are managed. For instance, a message thread with low complexity involved the patient reporting the lack of sleep because of ache in muscle and joints despite taking Trazadone. Another message thread with straightforward complexity included a request from a patient for a referral to a dietitian after experiencing digestive problems. Our analysis demonstrated that machine learning classifiers could identify the message threads that do not contain decision-making, which could potentially facilitate appropriate triage. Moreover, our machine learning models were able to identify threads that involved straightforward and low complexity decisionmaking with recall higher than 0.60 and weighted recall for all classes higher than 0.55.

Although payers do not yet reimburse for delivering care through patient portals, there are various benefits to identifying message threads that involve care delivery. One important use might be quantifying the care delivered online by various types of clinical providers to plan for appropriate staffing. Documenting the volumes of care delivered online might also support the case for reimbursement of such care. Managing the low complexity issues using patient portal messaging can benefit patients, providers, and healthcare organizations. Providing online care can save patients time and money by reducing the number of unnecessary visits to clinics or hospitals, which can be a burden if the patient lives far from a medical center or if the appointments are canceled or healthcare systems transition to telehealth because of a pandemic such as COVID-19. For surgeons, online postoperative care is particularly advantageous for procedures that typically have an uncomplicated course. Further, managing low complexity care online can make available clinic appointments for higher complexity medical needs, and this availability benefits the medical center, allowing them to most effectively utilize their resources.

Our study has limitations. First, we used portal message threads from a single academic medical center data using a locally developed patient portal for analysis, and our results may not translate to other settings. Our center has transitioned to a popular commercial patient portal, so future analyses may provide a more generalizable result. Second, our data set is small and only contains message threads initiated by patients and sent to surgical providers. Although the message threads were sent to a wide variety of surgical specialties, the language used in portal messages about surgical disease might be significantly different from the content of portal messages involving other specialties. The manual annotation process is laborious, making the creation of large labeled data sets challenging. We are implementing a semisupervised machine learning model that can leverage this data set to quantify the care delivered in other threads that have not been annotated or labeled. In our future work, when we have a larger annotated data set, we can expand our features such as combining TF-IDF and bag of words or using word embedding and deep learning classification (e.g. a convolutional neural network). Feature expansion with the existing data would risk overfitting. Third, we did not extract the lay terms that patients or caregivers might use in portal messages. In the message threads we analyzed, lay language was often restated in medical terms in the provider responses. Identifying lay or slang language in consumer health messages is a hot topic in medical informatics and natural language processing and is an area of future research for our team. Finally, we tried only three methods because of the data set size. Training and evaluating other machine learning model such as convolutional neural network another possible s u l i e m a n e t a l d e c i s i o n c o m p l e x i t y i n p a t i e n t m e s s a g e s approach to improve classification performance when we have a larger data set.

Patient portals are popular consumer health applications that allow patients and their caregivers to interact using secure messaging. The adoption of secure messaging is increasing, and studies have shown that medical care of varying complexity is delivered through patient portal message threads exchanged with surgical providers. This study is one of the first to attempt to classify patient portal messages by whether they involve medical decision-making and the complexity of that decision-making. Machine learning models that analyzed content resulted in better message thread classification than classifiers that employed medical terms in the messages alone. Further research is needed to improve the performance of these classifiers to potentially support triage of portal messages or quantification of online care to inform staffing needs or to support reimbursement for online care.

The main author (third author) works at IBM Watson Health.

Authors contributions: First author performed the analysis and wrote the paper. Second author prepared the data set and edited the paper. Third author supervised the analysis and edited the paper.

The first and second authors have nothing to disclose. r e f e r e n c e s

What is a patient portal

Characteristics of patient portals developed in the context of health information exchanges: early policy effects of incentives in the meaningful use program in the United States

MyHealthAtVanderbilt: policies and procedures governing patient portal functionality

What is a patient portal

The effect of patient portals on quality outcomes and its implications to meaningful use: a systematic review

Growth of secure messaging through a patient portal as a form of outpatient interaction across clinical specialties

Rapid growth in surgeons' use of secure messaging in a patient portal

Tulledge-Scheitel SM. Patient-generated secure messages and eVisits on a patient portal: are patients at risk?

Complexity of medical decision-making in care provided by surgeons through patient portals

A comparison of rule-based and machine learning approaches for classifying patient portal messages

Application of a consumer health information needs taxonomy to questions in maternal-fetal care

Common consumer healthrelated needs in the pediatric hospital setting: lessons from an engagement consultation service

Consumer health-related needs of pregnant women and their caregivers

A technology-based patient and family engagement consult service for the pediatric hospital setting

Automated classification of consumer health information needs in patient portal messages

Classifying patient portal messages using Convolutional Neural Networks

Tulledge-Scheitel SM. Patient-generated secure messages and evisits on a patient portal: are patients at risk?

Adoption of secure messaging in a patient portal across pediatric specialties

Fee-for-Service-Payment/PhysicianFeeSched/Evaluationand-Management-Visits.html

A case-based approach to outpatient evaluation and management service coding

Using tf-idf to determine word relevance in document queries

Clamp -a toolkit for efficiently building customized clinical natural language processing pipelines

Achieving coding consistency. http:// library.ahima.org/doc?oid¼101092#.Xo4eO4hKg2w

Coding and reimbursement for weight loss surgery: best practice recommendations

The KnowledgeMap project: development of a concept-based medical school curriculum database

Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES)