key: cord-0451884-ya56t1zt
authors: Ozyegen, Ozan; Kabe, Devika; Cevik, Mucahit
title: Word-level Text Highlighting of Medical Texts forTelehealth Services
date: 2021-05-21
journal: nan
DOI: nan
sha: caf5c4e1fd3d038f0d3b18ec5c5e630f4108491d
doc_id: 451884
cord_uid: ya56t1zt

The medical domain is often subject to information overload. The digitization of healthcare, constant updates to online medical repositories, and increasing availability of biomedical datasets make it challenging to effectively analyze the data. This creates additional work for medical professionals who are heavily dependent on medical data to complete their research and consult their patients. This paper aims to show how different text highlighting techniques can capture relevant medical context. This would reduce the doctors' cognitive load and response time to patients by facilitating them in making faster decisions, thus improving the overall quality of online medical services. Three different word-level text highlighting methodologies are implemented and evaluated. The first method uses TF-IDF scores directly to highlight important parts of the text. The second method is a combination of TF-IDF scores and the application of Local Interpretable Model-Agnostic Explanations to classification models. The third method uses neural networks directly to make predictions on whether or not a word should be highlighted. The results of our experiments show that the neural network approach is successful in highlighting medically-relevant terms and its performance is improved as the size of the input segment increases.

Since the start of the 20th century, people have been using telecommunication technologies to access health care services. This preference of communication over in-person visits has seen a gradual growth in popularity over the years. More recently, as a result of the COVID-19 pandemic, the demand for these services has significantly increased. This is especially the case with the current, ongoing lockdowns, as many people find themselves staying at home and using telehealth services to seek medical advice. The increasing demand for telehealth services presents opportunities for innovation in this area.

Information overload in the health care sector is a growing problem. Medical professionals and researchers rely on information from many resources such as clinical notes and scientific literature to accomplish their work. It is crucial for them to have quick and efficient access to up-to-date information.

The information overload that these professionals face is mainly due to the enourmous amount of available unfiltered information. As a result, more time is spent searching for medical data and filtering out the ones that are not important. This problem manifests itself in telehealth services as well.

Medical professionals need to read a great deal of information about their patients who are seeking medical advice. Furthermore, when the doctors interact with multiple patients, they must keep track of each conversation.

Automatic text highlighting of medically relevant information has potential to significantly improve the effectiveness and efficiency of healthcare practitioners since physicians would spend less time examining full text and only focus on key words and segments [8] . Many other domains have benefited from automated text highlighting. Ramírez et al. [22] has found that text highlighting can reduce the decision time to almost a half in crowdsourcing tasks. Researchers in information management and psychology have shown that text highlighting can improve the reading time [34] . Previous studies have also examined the benefits of text highlighting in other areas such as supporting workers in the digitization of tasks by highlighting key fields [3] , requesting highlights as evidence and reasoning to support judgements [29] , recommending text passages to facilitate the job of text annotators [33] , explaining the output to different machine learning models [21] , and highlighting key terms in medical documents [8] .

Biomedical informatics present certain challenges for natural language processing techniques. For instance, text categorization literature on electronic health records is limited. This is most likely due to the lack of train and test data that consists of high quality, labeled, full-length documents [4] .

Moreover, the contents of electronic health record systems are largely composed of clinical notes in the form of unstructured, unclassified texts [17] .

In order to address these challenges, large biomedical datasets and other resources have been made available to the researchers. The available resources contain a wide range of texts from medical databases, such as clinical reports and full papers published in medical journals. With the increased amount of available data, researchers can now train large machine learning models for summarization in the medical domain [17] .

With the amount of data at the disposal of medical researchers and physicians, text highlighting and document summarization have become increasingly popular in the medical domain. One can utilize this information to study diagnoses and common symptoms in patients without having to examine full documents and transcripts between patients and doctors. Knowledge repositories such as the Unified Medical Language System (UMLS) [6] contain numerous medical terms, and can be exploited in several ways by summarization engines. The Metathesaurus forms the base of the UMLS and comprises over 1 million biomedical concepts and 5 million concept names, all of which stem from over 100 incorporated controlled vocabularies and classification systems.

In this study, we mainly focus on text highlighting and extractive summarization for an online chat service connecting patients and doctors. This service allows the patients to contact board certified medical doctors, and communicate their symptoms and/or any health concerns. It is a novel way to provide access to quality healthcare without the need of having to go to a hospital, or walk-in clinic, similar to that of telehealth. This also reduces the barriers that people face when trying to receive healthcare including taking time off work and incurring high medical costs. Regarding the COVID-19 pandemic, increased access to the medical services without needing in-person visits might save important resources, and has potential to reduce the infections considerably.

In an online medical chat service, doctors receive online messages from patients experiencing health issues and give recommendations. Since the realm of medicine is so extensive, it would be beneficial for doctors to read the important specifics of the problem, rather than the full text. In addition, this ensures that the doctors do not miss useful pieces of information hidden inside a large text. The primary goal of this paper is to develop a mechanism that takes the messages from the patients and highlights important words and short segments to facilitate doctors in their responses. This will make the overall job easier for the doctors, and speed up the time it takes to read and respond to patient queries.

We investigate three different approaches for text highlighting of medical texts. The first is by using TF-IDF scores directly to find important parts of the text. The second is a combination of TF-IDF scores and Local Interpretable Model-Agnostic Explanations (LIME). The latter is a method used to interpret models after they make their classifications [19] , and the former shows how much a datapoint is indicative of the prediction. The final approach uses neural network models directly to decide whether each word should be highlighted.

The main contributions of this study can be summarized as follows:

• To the best of our knowledge, this is the first study on word-level text highlighting of medical texts. We establish a strong baseline and opensource one of our datasets to encourage further work in this area.

• We propose two novel word-level text highlighting methods for medical texts. The methodologies we present show how to leverage three different sources of information: large medical term corpora, existing relevant metadata and manually labelled samples.

• We evaluate five text highlighting models trained using three different aproaches on two medical text datasets to provide a detailed numerical study that shows the effectiveness of the proposed methods. This way, our study also provides evidence on the effectiveness of standard approaches for medical text highlighting.

The rest of this paper is organized as follows. In Section 2, we provide a brief literature review which overviews existing text highlighting methods.

In Section 3, we list the different datasets and their specifications used in our analysis. In Section 4, explain the different text highlighting methodologies we employed. We provide the results of our detailed numerical study in Section 5, which also includes performance comparisons for the text highlighting methodologies. Lastly, in Section 6, we discuss the findings of our study, highlight its weaknesses and, provide future research directions.

There are two different approaches to generating summaries from text: extractive and abstractive [1] . The extractive summarization extracts sentences or small segments directly from the source text whereas the abstractive summarization may contain segments that are not present in the source text.

The extractive approach can further be divided into two: sentence-level and word-level summarization. Most research in extractive summarization focuses on sentence-level summarization [16] . The task of sentence selection can be considered as an information retrieval task [24] , where all sentences within a text are evaluated through a scoring mechanism, and the highest scoring sentences are selected as being the most relevant to a summary.

A common issue in extraction-based summarization is how to determine which sentences must be kept in a summary and which must be excluded [17] .

Disagreement among annotators can arise from various reasons including missing context, imprecise questions, contradictory evidence, and multiple interpretations stemming from diverse levels of annotator expertise [11] . This is why it is important to measure the level of agreement to reach a consensus on what should be highlighted. For measuring inter-annotator agreement, Cohen's Kappa and Krippendorf's alpha are two popular measures [28] . To achieve consensus among the physicians, Van Vleck et al. [31] conducted a series of structured interviews with five Department of Medicine residents at New York-Presbyterian Hospital. The medical record of each physician is observed to identify sentences which are relevant for summarizing a patient's history. The physicians were provided with complete, de-identified patient records for three common general medical admissions. For each case, they were asked to acquaint themselves with the patient and then to underline information crucial to describing the patient to a colleague.

Training word-level extractive summarization models requires a strategy to identify important segments of text in the dataset. Term frequency-inverse document frequency (TF-IDF) is a widely used numerical statistic which can reflect how important a word or a segment of text is to a document in a corpus. Laban et al. [15] use TF-IDF to identify important keywords for training a reinforcement based abstractive summarization model.

Moen et al. [17] propose and implement four novel automatic extractive summarization methods using electronic health records from patients in Finland. These methods were developed specifically with the clinical domain in mind. The first approach is called repeated sentences. The underlying hypothesis for this method is that information that is repeated multiple times throughout an instance is the most important information to include in a summary. The second approach is called case-based, which retrieves existing cases with similar content. The underlying hypothesis is that patients with similar instances have similar content in their discharge summaries. The sentences from these discharge summaries are then treated as the central 'topics' for what to include in the summary. The third approach is called translate, which aims to construct a type of translation system that can map sentences in clinical notes to the most probable sentences to be found in an accompanying discharge summary, based on translation statistics learnt from a large clinical corpus. The final approach is called composite, which consists of combining the above three methods, normalizing the scores and selecting the top sentences for the final summary.

Using domain concepts to form summaries is another method for sentence extraction. For instance, Dudko et al. [8] present a new approach of data retrieval from documents in the medical domain. This approach generates a summary of the document and highlights the most important words using document ontology which includes information about all recognized keywords and concepts in the document. Document ontology creation also includes the measure of relevance for each concept which is calculated through assigning points to the concepts [9] . The approach relies on the identification of medical keywords within the document text first, and then accumulating them into a document ontology in the form of a graph. This is used to produce the document summary. The keyword recognition in the text of a given document is performed using natural language processing techniques. Every identified keyword receives one point and if a keyword is repeated multiple times, every occurrence provides an additional point. The voting approach is then used to highlight terms with the highest number of points. This proposed approach does not try to suggest a diagnosis or make any other kind of decision automatically. It demonstrates which concepts are written about in the document and which keywords explain these concepts, whereas our methods form a prediction on the text, and then highlight based on the given prediction. Similarly, Reeve et al. [24] proposed the frequency of domain concepts as a method to identify important sentences within a full-text to summarize biomedical texts. This method scores sentences based on a set of features, and extracts sentences with the highest scores to form a summary.

Extractive summarization is also possible using deep learning approaches.

Verma and Nidhi [32] propose a text summarization approach for factual reports using a deep learning model, that consists of three phases, namely feature extraction, feature enhancement, and summary generation. They use deep learning in the second phase to build complex features out of simpler features extracted in the first phase. Specifically, they employ a Restricted Boltzmann Machine to enhance those features to improve the resulting accuracy without losing any important information. They score the sentences based on those enhanced features, and construct an extractive summary accordingly.

Aside from standard extractive summarization methods, another method to perform text highlighting is through LIME. It is a model-agnostic approach and involves training an interpretable model on samples created around a specific data point by perturbing the data [25] . Once a prediction is made, LIME outputs the features that are indicative of the prediction. Nguyen [21] uses LIME to evaluate the interpretability of text classification models.

Ribeiro et al. [26] emphasize the importance of being able to trust a model's predictions. They employ LIME to evaluate text classification models by using human subjects to measure the impact of explanations on trust and associated tasks. Moradi and Samwald [20] use LIME in the context of medical text classification, where they produce explanations for predictions on a disease-treatment related information extraction task.

In our analysis, we employ LIME as a method of highlighting important words relevant to patients' conditions in their messages to doctors. To highlight the important words via LIME, patients' messages are first classified into a category, and then LIME is utilized to extract words relating to the category.

We use two different datasets in our analysis. The first one is the propriety medical chat dataset that consists of chat messages between patients and doctors, while the second one is the open source MTSamples data which includes medical transcriptions for the patients treated in a hospital setting.

This dataset is obtained from a telehealth service company. It contains written conversations between patients and doctors, where patients explain how they feel and symptoms they have, according to which doctors provide medical advice. In most cases, the doctor is able to provide recommenda-tions without physical examination. However, when further tests are needed to make a proper diagnosis, the patient is advised to go to a hospital or walk-in clinic. The dataset includes a collection of 33,699 doctor-patient conversations between October 6, 2019 and July 15, 2020, and 901,939 messages exchanged between them.

A conversation always starts with a patient message automatically generated using a predefined questionnaire that the patient fills out. Based on this form, an important metadata called issue category is created. This property describes the type of the issue patient has, such as pregnancy or COVID-19, and every conversation is tagged with an issue category. The medical chat dataset is used to derive two datasets: Medical Chat Classification Dataset and Medical Chat Highlighting Dataset to allow experimentation with different methodologies.

This dataset is designed for a multi-class classification task. The goal is to predict the issue category from a doctor-patient conversation. There exists eight issue categories in the dataset. These are cold or flu, COVID-19, gastrointestinal, pregnancy, sexual health, skin, pain management and other. In our study, the last two categories were omitted, because the pain management category has very few samples and the other category contains many mislabeled samples.

It is important to note that we use this dataset in order to leverage the existing labels (issue categories) available in the dataset. The methodology proposed using this dataset allows us to design a text highlighting model without any additional annotations for text highlighting.

This dataset contains a manually labelled subset of the medical chat dataset for the text highlighting task. A subset of 350 conversations were randomly selected from the last two months of the available data. The data contains around 8,330 messages since each chat includes on average 23.8 messages. The first 300 conversations and the last 50 were separated as the training and test sets, respectively.

In the text highlighting task, each word has a binary label, 1 for high- 

The second dataset we use is derived from an open-source dataset called MTSamples [35] . It is a large collection of publicly available transcribed medical reports. It contains sample transcription reports, provided by various transcriptionists for many specialties and different work types. It is important to note that this dataset was used only for evaluating the text highlighting models, and it was not used for training. The models trained on the previously mentioned datasets were also tested on this dataset to evaluate the usability of the models on different medical texts as shown in 

We applied the following preprocessing steps to improve the quality of the inputs provided to the models.

• OpenNMT tokenizer [14] is used for tokenization. This tokenizer supports reversible tokenization by annotating tokens or injecting modifier characters. We further implemented a wrapper tokenizer around the OpenNMT tokenizer to easily map the predicted highlights back to the original raw texts.

1 Github link will be provided here in the final version of the paper • We decided to keep stopwords in the datasets since these words are sometimes part of the highlights. We observed that the model performances drop if we remove the stopwords, since the model is unable to highlight stopwords such as "of" or "my".

• Other preprocessing steps include removing URLs, converting all characters to lowercase and finally, removing non-Unicode characters and line breaks.

In this section we describe our experimental setup which is summarized in Figure 1 . We train five text highlighting models using three different methodologies. The first one is the baseline model that uses TF-IDF scores to find the important segments in the text. Unlike the other methodologies, TF-IDF only uses the Medical Chat Highlighting dataset for training. In the LIME approach, we leverage the existing labels on the dataset for the text highlighting purpose. Here, we first train our models on a classification task.

Then, in the evaluation stage, we use the feature importances produced by the LIME method to decide which segments should be highlighted. Finally, in the neural network based approach, we first train the models on a large medical terms corpus, and then fine tune the models on the Medical Chat Highlighting dataset. given word to a specific document. Therefore, words that are common in a single or a small group of documents tend to have higher TF-IDF scores compared to common words such as "i", "me", "to", and "the". There are varying procedures for implementing TF-IDF. We summarize the overall approach as follows. Given a document collection D, an n-gram t and an individual document d ∈ D, the TF-IDF score of each n-gram in a given document is calculated as follows:

where f t,d is the frequency of n-gram t in document d and f t,D is the number of documents that contains the n-gram t, also referred as inverse document 

An important disadvantage of the TF-IDF method is that no contextual information is provided to the model. However, in some cases, the dataset may contain labels that can be leveraged to solve the text highlighting task.

In the LIME approach, we present a strategy for using the existing labels in a dataset for text highlighting. The process for LIME can be seen in Figure 1 . The LSTM results were omitted from our analysis with LIME, because, for LSTM models, we observed that the feature importances tend to be distributed among more words. A similar observation has been made in [21] .

The more distributed representation of feature importances results in a significantly lower text highlighting performance using the LIME method. LR and SVM models on the other hand tend to assign much higher importance to a few words that are important for the prediction.

The Long Short Term Memory (LSTM) network [12] is a very popular choice for sequential data processing because of its ability to capture long and short term dependencies. Graves and Schmidhuber [10] 

In this section, we summarize the results of our numerical study. We We performed all experiments using a workstation with i9-9990K 3.6GHz CPU, RTX2070 SUPER 8Gb GPU, and 128Gb of RAM, with Debian Linux OS. The Scikit-learn library is used for implementing TF-IDF, LR and SVM.

For the neural network architecture implementations, Tensorflow 2 library is used.

In this section, we first briefly define the two sets of performance metrics used to evaluate different algorithms. Then, we report the performance of the five models on the Medical Chat Highlighting and MTSamples datasets.

The first set consists of precision and recall, which describes the "accuracy" of the model for a specific threshold (i.e., "single threshold"). These is severe class imbalance between the classes or when the minority class is more important [27] . For text highlighting, one can argue that the positive (minority) class is more important, since these correspond to medical terms.

Thus, we computed both AUC-ROC and PR-AUC scores which resulted in the same ranking between the models.

As described in Section 4.2, LR and SVM models that are used for the LIME method are first trained on Medical Chat Classification dataset. The classification performance of these models are shown in Figure 2 . SVM model outperformed the LR model for all the categories in terms of the F1-score.

After all five models are trained on the corresponding datasets, they are Figure 6 , we see that the model is unable to capture some of the date related information and adjectives. We observed a similar misclassifications across all the models. Further analysis of the misclassified terms showed that there is some mislabelling of these term in the training data where phrases like "last week" are highlighted in some cases but, not highlighted in others. As such, we note that further verification of the training set can improve the performance of the models. 

In our analysis, we first trained the unigram and N -gram LSTM models on the Medical Terms dataset where the goal is to classify between medical and non-medical terms. This training step enabled the model to capture establishing a baseline and exploring various strategies, we investigate the feasibility of word-level text highlighting of medical texts in the telehealth domain. We propose two new word-level text highlighting methodologies for medical texts. The methodologies that we use leverage three different sources of information: large medical term corpora, existing relevant metadata and manually labelled samples. Furthermore, by evaluating five different text highlighting models on two medical text datasets, we provide a detailed numerical study that shows the effectiveness of the proposed methodologies.

Our results show that our best performing model, n-gram LSTM can successfully highlight the important information in medical texts, as evidenced by high precision and recall values for the generated highlights.

A relevant area of future research is to evaluate the benefit of the presented text highlighting models through an evaluation with the telehealth users (doctors). This type of evaluation can present new insights on how to design word-level text highlighting models for telehealth applications. Another area of research is incorporating active learning to the labelling process.

That is, by using active learning strategies, the dataset preparation time and the labeling process can be improved, and the text highlighting dataset samples can be selected more carefully for labelling.

Summarization from medical documents: a survey

Natural language processing of clinical notes for identification of critical limb ischemia

Cognitively inspired task design to improve user performance on crowdsourcing platforms

Generalpurpose text categorization applied to the medical domain

History of the United States

The unified medical language system (umls): integrating biomedical terminology

The Adventures of Sherlock Holmes. Braille Writers' Association of Victoria

Natural language processing knowledge network approach for interactive highlighting and summary

An information retrieval approach for text mining of medical records based on graph descriptor. Information Modelling and Knowledge Bases XXX

Framewise phoneme classification with bidirectional lstm and other neural network architectures

Crowdverge: Predicting if people will agree on the answer to a visual question

Long short-term memory

Adam: A method for stochastic optimization

OpenNMT: Open-source toolkit for neural machine translation

The summary loop: Learning to write abstractive summaries without examples

From extractive to abstractive summarization: A journey

Comparison of automatic summarisation methods for clinical free text notes

Distributional semantics resources for biomedical text processing

Interpretable machine learning

Explaining black-box text classifiers for disease-treatment information extraction

Comparing automatic and human evaluation of local explanations for text classification

Understanding the impact of text highlighting in crowdsourcing tasks

Using tf-idf to determine word relevance in document queries

Concept frequency distribution in biomedical text summarization

Modelagnostic interpretability of machine learning

why should i trust you?" explaining the predictions of any classifier

The precision-recall plot is more informative than the roc plot when evaluating binary classifiers on imbalanced datasets

Gold standard online debates summaries and first experiments towards automatic summarization of online debate data

Resolvable vs. irresolvable disagreement: A study on worker deliberation in crowd work

Assessing data relevance for automated generation of a clinical summary

Extractive summarization using deep learning

Crowdsourcing annotations for websites' privacy policies: Can it really work?

Improving searching and reading performance: the effect of highlighting and text color coding

Automatic identification of lifestyle and environmental factors from social history in clinical text

The authors would like to thank Your Doctors Online for providing funding and support for this research. This work was funded and supported by Mitacs through the Mitacs Accelerate Program.