key: cord-0288919-wlcju70g authors: Kanwal, Neel; Rizzo, Giuseppe title: Attention-based Clinical Note Summarization date: 2021-04-18 journal: nan DOI: 10.1145/3477314.3507256 sha: 5929f0ae04c680d7cb8840a3af3d6e1da66ed560 doc_id: 288919 cord_uid: wlcju70g In recent years, the trend of deploying digital systems in numerous industries has hiked. The health sector has observed an extensive adoption of digital systems and services that generate significant medical records. Electronic health records contain valuable information for prospective and retrospective analysis that is often not entirely exploited because of the complicated dense information storage. The crude purpose of condensing health records is to select the information that holds most characteristics of the original documents based on a reported disease. These summaries may boost diagnosis and save a doctor's time during a saturated workload situation like the COVID-19 pandemic. In this paper, we are applying a multi-head attention-based mechanism to perform extractive summarization of meaningful phrases on clinical notes. Our method finds major sentences for a summary by correlating tokens, segments, and positional embeddings of sentences in a clinical note. The model outputs attention scores that are statistically transformed to extract critical phrases for visualization on the heat-mapping tool and for human use. Presenting text in a shorter form has been practiced in human history long before the birth of computers. A summary is defined as a document that conveys valuable information with significantly less text than usual [1] . Summarization can be sensitive in the medical domain due to medical abbreviations and technicalities. The summarization task can be categorized into two categories from a linguistic perspective. Extractive summarization is an indicative approach where phrases are scored based on similarity weights and chosen to produce verbatim. Contrarily, abstractive summarization is an informative approach that requires understanding a topic and generating a new text using fusion and compression. It relies on novel phrases, lexicon, and parsing for language generation [2] . Natural language processing (NLP) has been valuable to clinicians in saturated work environments. For instance, health information systems have reduced the workload of doctors during the Coronavirus (COVID- 19) pandemic. Therefore, a clinical note summarizer can be helpful in a similar fashion. The notion of presenting a condensed version of literature abstracts using computers and algorithms became part of significant research in the late 1950s [3] . This approach was based on finding word frequency and scoring their significance. This idea later evolved in finding summaries based on grammatical position in the text [4] . In contrast, other works [5, 6] proposed query-based summarization frameworks, similar to information retrieval techniques. These methods were similar to multi-query vector in multi-head attention for mapping using a query and key-value pair to an output. Meanwhile, some fundamental questions that arise while summarization are i) which content is essential to select and ii) how to create a shorter version of it [7] . With rising popularity of transformer [8] as a tool for various text analysis tasks [9] . Our work proposes a transformer-based method for selecting meaningful phrases from clinical discharge summaries and extracting them by preserving the sense of clinical notes based on the identified disease. Inspired by [10] , we have fine-tuned a Bidirectional Encoder Representation Transformer (BERT) model on discharge notes of the Medical Information Mart for Intensive Care (MIMIC-III) dataset. These discharge notes are classified by diseases and capture important syntactic information based on International Classification of Diseases (ICD-9) labels. We extract a discrete attention distribution from the first head of the last layer. This probabilistic distribution is later translated using power transforms [11] to create a monotonic attention distribution over bell curve [12] . Finally, the summary comprises sentences with attention scores higher than the mean attention scores of all other sentences in the original clinical note. This paper is organized as follows: Section 2 illustrates various extractive summarization approaches and their implementations on medical documents. Section 3 describes the methodology for extractive summarization task. Section 4 presents various evaluation methods and states a suitable method for this task. We have displayed results in section 5. Finally, we conclude the discussion in section 6 with limitations possible future directions in section 7. The early works on summarization are based on many different surface-level approaches for the intermediate representation of text documents. These methods focus on selecting top sentences based on greedy algorithms and aim to maximize coherence and minimize redundancy [13, 14] . These techniques can be further generalized into: • Corpus-based Approach: It is a frequency-driven approach based on common words often repeated and do not carry salient information. It relies on an information retrieval paradigm in which common words are considered query words. SumBasic [15] is a similar centroid-based method that uses word probability as a factor of importance for sentences. Words in each centroid with higher probabilistic values are selected for a summary. • Cohesion-based Approach: Some techniques fail when extraction is bound to anaphoric 1 expressions or lexical chains to relate two sentences. Brin et al. [16] proposed a co-reference system that uses cohesion in web search. In clinical notes, anaphoric expressions are frequently used, but they refer to the same subjects meaning that we have a relation to the patient only. • Rhetoric-based Approach: This approach relies on forming text organization in a tree-like representation [17, 18] . Text units are extracted based on their position close to the nucleus. For clinical summaries, we often have multiple nuclei for different diseases. • Graph theoretic Approach: A few popular algorithms like HITS [19] and Google's PageRank [20] instigated base for graph-based summarization. It helps to visualize intra-topic similarity where nodes present several topics, and vertices show their cosine similarity with sentences [21] . It makes visual representation easy with different tools like MDI-GESTS [22] . • Machine Learning (ML) Approach: ML models outperform nearly all kinds of tasks, including text summarization. Neural networks (NN) can better exploit hidden features from the text. Attention mechanism coupled with convolution layers helps to choose important phrases based on their position in the document. A recent trend of analyzing text using Bayesian models has also gained popularity [23] . Miller et al. [24] used BERT to make text encoding and applied K-means to find sentences from health informatics lectures. Their model offered a weakness for large documents since the extraction ratio is fixed for K sentences. Liu et al. [25] trained an extractive BERT model from abstractive summaries using a greedy method to generate an oracle summary for maximizing the ROGUE score. BERTSUM [25] used a trigram blocking method to extract candidate sentences based on golden abstractive summaries in CNN/daily mail dataset. Clinicians heavily rely on text information to analyze the condition of the patient. Vleck et al. [26] followed a cognitive walk-through methodology by identifying specific relevant phrases to medical understanding. Laxmisan et al. [27] formed a clinical summary screen to integrate with health systems. The core purpose was to avail more interaction time for a clinician. Feblowitz et al. [28] proposed a five-stage architecture to facilitate clinical summarization tasks. Their framework (AORTIS) described distinct phases, namely Aggregation, Organization, Reduction, Interpretation and Synthesis (AORTIS) [29] . It was based on producing short laboratory reports. The AORTIS model was assessed and validated using Cohen's Kappa index. Alsentzer et al. [30] did a similar job using Bayesian modeling. Pivovarov et al. [31] employed heterogeneous sampling and topic modeling. Their approach materialized a Concept Unique Identifier (CUI) upper bound to choose a phrase with a high probability of being classified as a disease. Thomas et al. [32] proposed a semi-supervised graph-based method for summarization using NN and node classification. The model was limited to datasets other than the clinical domain. Other researchers followed a similar kind of approach for Multi-document summarization [33, 34] . Azadani et al. [35] later carried out these ideas to biomedical summarization. Their model uses graph clustering that forms a minimum spanning tree using Unified Medical Language System (UMLS). Another ontology-oriented graphical representation method uses clustering to form data centrality and mutual refinement [36] . Mis-classification index (MI) was used as a primary evaluation metric to verify cluster purity. Generally, graphical methods have concluded better results in most of the cases in the literature. Our approach uses a base BERT-model fined-tuned on ICD-9 labeled MIMIC-III discharge notes. The model was trained mainly to identify ICD-9 labels based on described symptoms and diagnostic information. The model outputs attention scores for all sentences from discharge notes. We extract sentences whose attention scores are higher than the mean value of all other sentences in the original note. The model is compared against three baselines [24, 34, 37] using divergence methods of word probability distributions for quantitative analysis. Our summarization approach works effectively when reference or human-made summaries are not available. Furthermore, table 2 represents qualitative analysis against chosen baseline approaches. MIMIC-III is an open-access publicly available database of unstructured and unidentified health records [38] . It is an extensive relational database with 26 tables linked with subject identity. It includes raw notes of 36,998 patients for each hospital stay in the "EVENTS" table (see 2 ). Each discharge note is tagged with a unique label for the identified disease. In total, we have 47,724 clinical discharge notes that comprise several details from radiology, nursing, and prescription. These notes can be equated with a multi-topic document based on multiple labels for each medical note. Moreover, MIMIC-III was only published for labeling diagnostics and does not contain reference summaries. We have utilized a neural architecture that is built on top of transformer [8] . BERT is multi-layer neural architecture and has two major variants. We have used a base variant with 12 layers (transformer blocks), 768 hidden sizes, 12 attention heads, and 110 million parameters. The language model has shown significant improvements in various language processing tasks with fine-tuning [39, 40] . The BERT model usually creates embeddings in both directions for the representation of the inputs [41] . We have found that attention heads corresponding to delimiter tokens are remarkably effective for semantic understanding. The work presented in this paper employs a fine-grained understanding of the notes to signify sentences relevant to the classification and brings more information to a summary. A reduced sample of randomly selected 100 notes from the MIMIC-III dataset is selected to quantitatively and qualitatively assess the model's performance. Finally, we have demonstrated attention scores for sentences using a highlighting tool (see sec. 3.4) to inspect the output results. Figure 1 shows the model along with attention flow. Pre-processing: Clinical documents contain many irregular abbreviations and periods for their particular formatting. Some notes are in grammatical order, whereas other parts are written as review keywords. We have used the custom tokenizer presented in [42] to formulate data as lists of sentences. It removes tokens with no alphabetic character or a percentage of drug prescriptions. Token Representation: A sentence flows downstream as a sequence of tokens accompanied by two additional unique tokens. An input representation for any token is formed by combining token, position, and segment embedding. [CLS] is the first token that classifies a sentence and appears initially. [SEP] is a separator token used to identify the end of the stream. The output [CLS] representation can be fed to the classifier for different tasks. Fine-Tunning: Fine-tuning has a huge effect on performance for supervised tasks [43] . We have fine-tuned the BERT model on the entire MIMIC-III using maximum sequence length, batch size 8, learning rate 3e-5, ADAM optimizer with epsilon 1e-8, and keeping other hyper-parameters the same as that of pre-training. The finetuned model can classify [CLS] token for maximum-likelihood of ICD-9 label. Attention Extraction: Fine-tuning helps to encode semantic knowledge in self-attention patterns [43] . The multi-head attention mechanism embeds an attention score in tokens of every sentence in the clinical note. Since the last layer of the BERT model is considered vital to the task [44] , we capture the first attention head of the last layer for cross-sentence relation as observed with BertViz [45] . This attention head focuses on a special [CLS] token corresponding to the whole sentence. The obtained attention score for a sentence is used as a measure of significance in a clinical note. Equation 1 Figure 2: Attention visualization shows how every attentionhead in BERT architecture finds words useful compared to other words in a sentence. We have a different color for every head, identifying its position in the last layer. We have applied the idea to choose useful sentences compared to other sentences in the original note. The demonstration is performed using BertViz tool [45] . and Equation 2 show how the dot-product attention is calculated in layers. In a nutshell, a set of pre-processed clinical notes in a list of sentences is fed to the BERT encoder, creating embedding and attention scores at each layer. The attention score corresponding to [CLS] decides whether a sentence is a good candidate for the summary. Figure 2 illustrates the positional relevance of the [CLS] token. We have later selected sentences whose attention scores are above the average attention value of other sentences in the original note. This extraction incentives dynamic selection, unlike a fixed sentence summary ratio. For example, a sentence with an attention score of 0.14 is chosen if it is greater than the average of attentions of all sentences in the document. The attention distribution for tokens on the last layer has an irregular pattern. In order to perform a heat-mapping, we have utilized a tool, namely Neat-Vision 3 . This tool requires a fixed input format and outputs a text heatmap. It demands that input data be organized in a particular structure for vibrant coloring. We have stratified the distribution obtained from the neural architecture to the Gaussian distribution using the Quantile Transformation [11] . This turns a sentence with great attention more fragrant in visibility and vice versa. In other words, it makes sentences with higher attention scores rosier than the other ones in the Neat-Vision tool. Figure 3 shows the impact of transformation on attention series data. Here x-axis presents the number of sentences, and the y-axis accounts for the value of the attention score for the corresponding sentence. This demonstration will considerably impact clinician practice by alienating time spent while reading long health records. Figure 4 exhibits the usefulness of heat-mapping concepts for health systems. Evaluation in summarization has been a critical issue, mainly due to the absence of a gold standard. Many competitions such as DUC 4 , TREC 5 , SUMMAC 6 and MUC 7 propose different metrics. The interpretation of these metrics is not very simple, mainly because the same summary receives different scores under different measures. Automatic evaluation for the quality of the summary is an ambitious task and can be performed by making a comparison with a human-generated summary. For this reason, evaluation is normally limited to domain-specific and opinion-oriented areas [7] . Formally, these evaluation methods can be divided into two areas. In extrinsic evaluation, summaries are manually analyzed for the original document. For instance, a clinician can do this in our case and may result in different opinions based on his understanding. Miller et al. [24] leveraged this manual clinical evaluation to compare the performance of their model. In intrinsic evaluation, the extracted summary is directly matched with the ideal summary created by humans. The latter can be divided into two classes, primarily because it is hard to establish an ideal reference summary by a human. 4 http://duc.nist.gov/ 5 http://trec.nist.gov/ 6 https://www-nlpir.nist.gov/related_projects/tipster_summac/ 7 http://www.itl.nist.gov/iad/894.02/relatedprojects/muc/proceedings/muc7toc.html • Text quality Evaluation: It is more related to linguistic check that examines grammatical and referential clarity. This assessment is not comprehensive since medical summaries are unstructured documents with many abbreviations and clinical jargon. • Content-based Evaluation: It rates summaries based on provided reference summaries [46] . Some of the common approaches are ROGUE (Recall-Oriented Understudy for Gisting Evaluation) [47] , Cosine Similarity and Pyramid Method [48] . Liu et al. [25] studied unigram and bigram ROGUE overlaps for different components of BERTSUM for their singledocument summaries. Sripada et al. [49] in their work exemplified that a summary can be considered adequate if it has a similar probability distribution as that of the original document. The hypothesis was compared in other works [50, 51] where this light-weight and less complex method demonstrated more refined results. This criterion is a more practical evaluation for our methodology since we do not have the reference summaries. We will compare the distribution of words in the original and summary document to identify their effectiveness. We will use two tests for evaluating the goodness of our synopsis, namely Kullback-Leibler divergence (KLD) [52] and Jensen-Shannon divergence (JSD) [53] . KL Divergence: It is a measure of the difference between two distributions. This measure is asymmetric, and the minimum KLD value shows better relative interference for distributions; for discrete probability distributions P and Q mapped on probability space , KLD from Q to P is defined in Equation 3 [54] ; JS Divergence: It is an extension of KL divergence that quantifies the difference in a slightly modified way. It is a smoothed and normalized form, assuring symmetry among inputs as reported in Equation 4 [53] . where, We have presented results for comparing our model with three baseline extractive summarization methods. First, Part-of-Speechbased sentence tagging is established on the empirical frequency selection method. The second one centers around the graphical method, and the third one uses BERT combined with K-means to find top k sentences in a centriod. The results can be reproduced, and source code for replication is available on Github 8 repository. The proposed architecture shows significant improvement compared to baseline approaches. Divergence scores show how estimating differences in distributions can help in anticipating the word distribution of both documents. Table 1 exhibits that our extracted summaries are more informative than others based on a lower average of KL and JS divergence scores. Frequency-based approach outcomes highest divergence among others. JSD and KLD scores for the graph-based method show a relative amelioration compared to frequency-based methods. There is a bit of quantitative difference in values with the centriod-based K-means approach due to their nature of calculating sentence embeddings in a similar way. Our method is dynamic in choosing the length of the summary, which overcomes the weakness of fixed K sentences described in the paper [24] . Overall, the attention mechanism poses great abstraction power for the summarization task. KLD↓ JSD↓ Frequency-Based Approach 0.892 0.426 Graph-Based Approach 0.827 0.408 Centroid-Based K-means Approach 0.80 0.41 Our Proposed Architecture 0.795 0.405 Table 1 : Experimental results on a reduced sample-set of 100 random clinical notes from MIMIC-III dataset compared with Frequency-Based Approach [37] , Graph-Based Approach [34] and Centroid based K-means Approach [24] using KLD and JSD Values. A lower value pertains to a bettercorrelated summary. As noted in Figure 5 and 6, there are some summaries where distributional similarity does not outperform due to shorter length. 8 https://github.com/NeelKanwal/BERTOLOGY-Based-Extractive-Summarization-for-Clinical-Notes Centroid-Based K-means Summary: daily disp tablet delayed release e.c. lastnametoken on february at 15pm cardiologist dr. lastnametoken on february at 30am wound check on thurs january at am with cardiac surgery on hospitaltoken please call to schedule appointments with your primary care dr. lastnametoken in march weeks please call cardiac surgery office with any questions or concerns telephonenumbertoken answering service will contact on call person during off hours completed by january. Frequency-Based Summary: disp tablet refills ranitidine hcl mg tablet sig one tablet daily. refills tramadol tablet two tablet q6h hours as needed for pain. tablet senna mg tablet One tablet daily, disp tablet refills furosemide mg tablet for Mitral valve repair coronary artery bypass. graft x left internal mammary artery to left anterior descending history of present illness year old female who was told she had mvp since age currently quite active but has noticed some dyspnea on exertion when walking up hills most recent echo revealed severe mvp and moderate to severe Daily daily disp tablet delayed release e.c. s refills docusate sodium mg capsule sig one capsule po bid times Disp tablet er particles crystals s refills discharge disposition home with service facility hospitaltoken vna discharge diagnosis mitral regurgitation coronary artery disease. Graph-Based Summary: refills docusate sodium mg capsule one capsule a day magnesium hydroxide suspension thirty ml at bedtime as needed for constipation atorvastatin tablet one tablet daily. disp tablet s refills furosemide tablet once a day for days disp tablet refills ranitidine hcl mg tablet daily. please shower daily including washing incisions gently with mild soap no baths or swimming until cleared by surgeon look at your incisions daily for redness or drainage. please no lotions cream powder or ointments to incisions each morning you should weigh yourself and then in the evening take your temperature these should be written down on the chart no driving for approximately one month and while taking narcotics will be discussed at follow up appointment with surgeon when you will be able to drive no lifting more than pounds for weeks please call with any questions or concerns telephonenumbertoken females please wear bra to reduce pulling on incision avoid rubbing on lower edge. please call cardiac surgery office with any questions or concerns telephonenumbertoken answering service will contact on call person during off hours followup instructions you are scheduled for the following appointments surgeon dr. lastnametoken on february at 15pm cardiologist dr. lastnametoken on february at 30am wound check on thurs january at am with cardiac surgery on hospitaltoken. Our Proposed Approach: old female who was told she had mvp currently quite active but has noticed some dyspnea on exertion when walking up hills. she presents for surgical consultation past medical history mitral regurgitation copd secondary to asbestos exposure as a child arhtritis cataracts headaches lactose intolerance r wrist and elbow surgery. widowed occupation retired disabled nurse tobacco quit smoking in father died suddenly at cause unknown physical exam. no spontaneous echo contrast is seen in the left atrial appendage there is a small pfo with left to right flow overall left ventricular systolic function is normal lvef in the face of mr there is normal free wall contractility there are simple atheroma in the descending thoracic aorta the aortic valve leaflets are mildly thickened trace aortic regurgitation is seen the posterior leaflet is very degenerate and there is moderate to severe mitral regurgitation. there is no pericardial effusion the tip of the sgc is seen at the pa bifurcation post cpb the patient is av paced on no inotropes the pfo is closed normal biventricular systolic fxn there is a mitral ring prosthesis which is well seated trace mr residual mean gradient with an area of no ai aorta intact. mrs. lastnametoken was a same day admit after undergoing all pre operative work. she was tolerating a full oral diet her incisions were healing well and she was ambulating in the halls without difficulty it was felt that she was safe for discharge home at this time with vna services all appopriate follow up appointments were arranged. Table 2 : Qualitative evaluation of our approach with three baseline methods. The curve presents that attention-based extraction is more impactful than other counterparts. JS divergence metrics show less fluctuation than other metrics because of their averaging symmetry mechanism. Summaries from each method are placed in Table 2 for qualitative analysis. It can be observed that summaries generated by our proposed architecture have more coherence and make it easier to adapt clinical understanding. On the other hand, baselines approaches provide short and incoherent sentences for the selected note. Shorter summaries are more likely to lose discriminatory information and affect the degree of understanding; thus, evaluating the usefulness of a summary in terms of sentences may not be optimal. It may be hard for a non-specialist to understand the relative usefulness of each summary as described in section 4. This method shows the applicative benefits of dynamic summarization in healthcare systems. Furthermore, it is more helpful for a physician to grab the essence of diagnosis via highlighting tools as displayed in figure 4. The immense increase in digital text information has undoubtedly emphasized the need for universal summarization frameworks. Abstractive summarization has been an area of research debatable for specific scenarios, e.g., medical, because of the risk of generating summaries that deliver different meanings of the original notes reported by physicians. However, extractive summarization techniques are relatively reasonable in the clinical domain. At the same time, evaluation in medical summarization is at the most challenging degree compared to other domains. In this paper, we have elucidated a neural architecture for extracting summaries based on multi-head attentions. We have utilized statistical analysis methods to understand the magnitude of relevance between summary and original clinical notes. Our architecture achieves better results on a set of MIMIC-III clinical notes, outperforming frequency, graph-oriented, and centroid-based approaches. The proposed model is domain-specific and outperforms other methods debated in the literature. The evaluation criteria of finding divergence among distributions are suitable when ideal summaries are not present. Furthermore, our proposed model can be integrated into a decision-support system to better interpret clinical information by highlighting diagnostically related phrases. Medical summarization is a unique and delicate task. It is pretty hard to automatically evaluate whether the obtained summary is a well-condensed representation of the original document. Moreover, performing a qualitative evaluation is labor-intensive and subjective and may also depend on the physician's personal experience with similar diseases. A universal medical summarizer may omit limitations arising from the diverse writing style. The attention-based model is fine-tuned on the MIMIC-III dataset. Therefore, it may not perform well on different clinical notes, written in a different structure and mapped onto a different set of diseases. ICD-9 offers broad coverage and accurate cataloging of diseases; however, we consider ICD-10 in current research activities as more contemporized. A concoction of abstractive and extractive summarization using neural network language generative models may be more bankable for future work. Introduction to the special issue on summarization Indexing concepts and methods The automatic creation of literature abstracts Machine-made index for technical literature-an experiment The automated acquisition of topic signatures for text summarization Natural language information retrieval: Trec-4 report Automatic Text Summarization: Past, Present and Future Neural machine translation by jointly learning to align and translate Statistical modelling with quantile functions What does BERT look at? an analysis of bert's attention. CoRR Text summarization techniques: A brief survey Extractive summarization as text matching Beyond sumbasic: Task-focused summarization with sentence simplification and lexical expansion The anatomy of a large-scale hypertextual web search engine Summarizing text documents: Sentence selection and evaluation metrics A trainable document summarizer Statistics-based summarization -step one: Sentence compression on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence TextRank: Bringing order into text A survey of text summarization extractive techniques Summarize to learn: summarization and visualization of text for ubiquitous learning Email classification and summarization: A machine learning approach Leveraging bert for extractive text summarization on lectures Fine-tune bert for extractive summarization Assessing data relevance for automated generation of a clinical summary Clinical summarization capabilities of commercially-available and internally-developed electronic health records Summarization of clinical information: A conceptual model Development and evaluation of a comprehensive clinical decision support taxonomy: comparison of front-end tools in commercial and internally developed electronic health record systems Extractive summarization of EHR discharge notes Electronic health record summarization over heterogenous and irregularly sampled clinical data Semi-supervised classification with graph convolutional networks Towards coherent multi-document summarization Graph-based neural multi-document summarization Graph-based biomedical text summarization: An itemset mining and sentence clustering approach A coherent graph-based semantic clustering and summarization approach for biomedical literature and a new summarization evaluation method New methods in automatic extracting Mimic-iii, a freely accessible critical care database Fine-tuned language models for text classification Semi-supervised sequence learning BERT: pretraining of deep bidirectional transformers for language understanding Explainable prediction of medical codes from clinical text Revealing the dark secrets of BERT How does bert answer questions? a layer-wise analysis of transformer representations Proceedings of the 28th ACM International Conference on Information and Knowledge Management A multiscale visualization of attention in the transformer model. CoRR, abs Evaluation measures for text summarization Rouge: A package for automatic evaluation of summaries The pyramid method: Incorporating human content selection variation in summarization evaluation Summarization approaches based on document probability distributions A compositional context sensitive multi-document summarizer: exploring the factors that influence summarization Multidocument summarization by maximizing informative content-words Information theory and statistics. solomon kullback. new york: John wiley and sons, inc.; london: Chapman and hall On a generalization of the jensen-shannon divergence and the js-symmetrization of distances relying on abstract means A survey on automatic text summarization