key: cord-0432539-dpk6gvwg authors: Feng, Xiachong; Feng, Xiaocheng; Qin, Bing title: A Survey on Dialogue Summarization: Recent Advances and New Frontiers date: 2021-07-07 journal: nan DOI: nan sha: bf8b4813da5b0d400fbcc85e77c14e2f0659adc6 doc_id: 432539 cord_uid: dpk6gvwg Dialogue summarization aims to condense the original dialogue into a shorter version covering salient information, which is a crucial way to reduce dialogue data overload. Recently, the promising achievements in both dialogue systems and natural language generation techniques drastically lead this task to a new landscape, which results in significant research attentions. However, there still remains a lack of a comprehensive survey for this task. To this end, we take the first step and present a thorough review of this research field carefully and widely. In detail, we systematically organize the current works according to the characteristics of each domain, covering meeting, chat, email thread, customer service and medical dialogue. Additionally, we provide an overview of publicly available research datasets as well as organize two leaderboards under unified metrics. Furthermore, we discuss some future directions, including faithfulness, multi-modal, multi-domain and multi-lingual dialogue summarization, and give our thoughts respectively. We hope that this first survey of dialogue summarization can provide the community with a quick access and a general picture to this task and motivate future researches. Dialogue summarization aims to distill the most important information from a dialogue into a shorter passage, which can help people quickly capture the highlights of a semistructured and multi-participant dialogue without reviewing the complex dialogue context . With the development of communication technology and the ravage of COVID-19, different types of dialogues have emerged as an important way for information exchange. Therefore, there is an urgent need for summarization techniques to save people from large amounts of dialogue data. Conventional works mainly focus on single-participant document summarization, such as news and scientific papers [See et al., 2017] . Thanks to the neural models, espe- * Corresponding author. cially the sophisticated pre-trained language models, which have advanced these tasks significantly [Lewis et al., 2020] . Despite the success of single-participant document summarization, these methods can not be easily transferred to the multi-participant dialogue summarization. Firstly, the dialogue contains multiple participants, inherent topic drifts, frequent coreferences, diverse interactive signals and domain terminologies [Feng et al., 2021b] . All of these characteristics make dialogue a hard-to-model data type. Secondly, in terms of different domains, the above characteristics further pose domain-specific challenges to summarization models, e.g., How to model long meeting transcripts . Thirdly, compared with widely used document summarization benchmarks, collecting labeled dialogue-summary paired data is highly-costing or even intractable . To mitigate these challenges, researchers draw on successful experiences from the study of dialogue systems and natural language generation techniques and put their efforts on solving this challenging task, which result in nearly 100 papers covering various domains being published over the past 5 years. To review the current progress and help new researchers get into the field quickly, we present this first survey for dialogue summarization. As the preliminary, we quickly overview the recent progress in general summarization and capture several key time points and key techniques, this serves as a strong background before we dive into the dialogue summarization (see §2). As the core content, we summarize existing works according to the domain of dialogue, mainly covering the meeting, chat, email thread, customer service and medical dialogue. For each type of dialogue, we thoroughly go through related research works, organize them according to their unique challenges and provide suggestions for future works (see §3). For example, we focus on two main streams of works for chat summarization including interaction and participant modeling . In terms of customer service, we organize related works from two perspectives, one is inherent topic modeling [Liu et al., 2019a] , the other is taskoriented-specific information integration . Besides, we provide an overview of publicly available research datasets (see Table 1 ). Especially for meeting and chat summarization, we also carefully organize leaderboards under the unified evaluation metric by collecting reported results from published literatures and re-evaluating official outputs (see Table 2 and Table 3 ). Based on the analyses of existing works, we present several research directions, including faithfulness in dialogue summarization, multi-modal, multidomain and multi-lingual dialogue summarization (see §4). All of these frontiers not only pose new research challenges but also meet actual application needs and fit in with realworld scenarios. To sum up, our contributions are as follows: • We are the first to present a comprehensive survey for the dialogue summarization task. • We thoroughly summarize existing works according to different types of dialogues and carefully organize leaderboards under the unified evaluation metric. • We discuss some new frontiers and highlight their challenges to motivate future researches. In this section, we give an overview on the summarization task, then describe the commonly used evaluation metrics. Automatic summarization is a fundamental task in natural language processing and has been continuously studied for decades [Paice, 1990] . It aims to condense the original input into a shorter version covering salient information, which can help people quickly grasp the core content without diving into the details. It is mainly divided into two paradigms: extractive and abstractive. Extractive methods select vital sentences as the summary, which is more accurate and faithful, while abstractive methods generate the summary using novel words, which improves the conciseness and fluency of the summary. Previous works adopt machine learning algorithms to perform extractive summarization [Mihalcea and Tarau, 2004] . With sophisticated neural architectures, data-driven approaches have made much progress in both two paradigms. Especially for abstractive methods, sequence-to-sequence learning combined with attention mechanism is adopted as the backbone architecture for solving this task [See et al., 2017] . Recently, with the great success of pre-trained models in a wide range of natural language processing tasks, these models also become the de facto way for summary generation and have achieved many state-of-the-art results [Lewis et al., 2020] . ROUGE [Lin, 2004] is conventionally adopted as the standard metric for evaluating summarization tasks, which mainly involves the F1 scores for ROUGE-1, ROUGE-2, and ROUGE-L that measure the word overlap, bi-gram overlap and longest common sequence between the ground truth and the generated summary respectively. In this section, we describe the taxonomy of dialogue summarization according to the domain of input dialogue, including Name Domain Language ICSI [Janin et al., 2003] Meeting English AMI [Carletta et al., 2005] English QMSum [Zhong et al., 2021] English SAMSum [Gliwa et al., 2019] Chat English GupShup [Mehnaz et al., 2021] Code-Mix CSDS Customer Service Chinese TODSum English TWEETSUMM [Feigenblat et al., 2021] English CRD3 [Rameshkumar and Bailey, 2020] TV Show English [Song et al., 2020] Medical Chinese SumTitles [Malykh et al., 2020] Movie English MEDIASUM [Zhu et al., 2021] Interview English DIALOGSUM [Chen et al., 2021] Spoken English EMAILSUM [Zhang et al., 2021a] Email English ForumSum [Khalman et al., 2021] Forum English ConvoSumm [Fabbri et al., 2021] Mix English meeting, chat, email thread, customer service and medical dialogue. Table 1 lists currently available datasets for these dialogue summarization researches. Meeting plays an essential part in our daily life. Especially due to the spread of COVID-19 worldwide, people are more dependent on online meetings to share information and collaborate with others. Accordingly, meeting summaries, aka meeting minutes could be of great value for both participants and non-participants to quickly grasp the main meeting ideas. Thanks to the earlier publicly available datasets AMI [Carletta et al., 2005] and ICSI [Janin et al., 2003] , meeting summarization has attracted extensive research attentions. Precedent works focus on extractive meeting summarization. They adopt various features to detect important utterances, such key phrases, topics and speaker characteristics. However, due to the multi-participants nature, information is scattered and incoherent in the meeting, which makes the extractive methods unsuitable for meeting summarization. Therefore, recent years witness a growing trend of abstractive meeting summarization methods [Shang et al., 2018] . With the development of neural networks, many works have explored the application of deep learning in meeting the summarization task and have achieved remarkable success . Although deep learning-based methods have strong modeling abilities, taking only literal information into consideration is not sufficient. This is because there are diverse interactive signals among meeting utterances and the long meeting transcripts further pose challenges to traditional sequence-to-sequence models. To this end, some works devote efforts to incorporate auxiliary information for better modeling meetings, such as dialogue discourse [Feng et al., 2021b] , dialogue acts [Goo and Chen, 2018] and domain terminologies [Koay et al., 2020] . Besides, several strategies are carefully devised to handle long meeting transcripts, including hierarchical modeling strategy , sliding window strategy [Koay et al., 2021] , retrieve-thensummarize strategy [Zhang et al., 2021c] and pre-training strategy [Zhong et al., 2022] . [Carletta et al., 2005] and ICSI [Janin et al., 2003] datasets. We adopt reported results from published literatures [Feng et al., 2021b ] and corresponding publications. The results of Longformer [Fabbri et al., 2021] are obtained by evaluating the output files provided by the author. Results with * indicate that ROUGE-L is calculated with sentence splitting. Instead of summarizing the whole meeting, generating meeting summaries of a particular aspect, such as decisions, actions, ideas and hypotheses, could also address specific needs. Recently, Zhong et al. [2021] propose the query-based meeting summarization task, which aims to summarize the specific part of the meeting according to the given query. In addition to multi-party characteristics, meeting summarization has also been explored under the multi-modal setting. Meetings can include various types of non-verbal information that is displayed by the participants, such as audio, visual and motion features. These features may be useful for detecting important utterances in a meeting. Therefore, a majority of works study both the extractive and abstractive multi-modal meeting summarization problem by fusing verbal and nonverbal features to enrich the representation of the utterance . Leaderboard: To unify this research direction, we systematically present a comprehensive leaderboard for two widely used meeting summarization datasets: the AMI and ICSI datasets, using pyrouge package 1 , as shown in Table 2 . Highlight: Meetings always involve several participants with specific roles. Thus, it is necessary to model such distinctive role characteristics. Besides, the long transcripts also need the model to be capable of handling long sequences. Furthermore, the audio-visual recordings of meetings provide the opportunity for using multi-modal information. However, it is a double-edged sword. The error rate of automatic speech recognition systems and vision tools also pose challenges to the current models, which requires them to be more robust. Online chat applications have become an indispensable way for people to communicate with each other, which has led to 1 https://pypi.org/project/pyrouge/ people being overwhelmed by massive amounts of chat information. Such complex dialogue context poses a challenge to the new chat participant, since he or she may be unable to quickly review the main idea of the dialogue. Therefore, summarizing chats becomes a new trending direction. Gliwa et al. [2019] introduce the first high-quality and manually annotated chat summarization corpus, namely, SAMSum, and conduct various baseline experiments, which rapidly sparks this research direction. Afterward, take the first step and propose a multiview dialogue summarizer by introducing both topic segments and conversational stages. More importantly, they conduct a comprehensive study for the challenges in this task, revealing the importance of dialogue modeling for the dialogue summarization, which points out the direction for future researchers. Roughly speaking, the majority of current works put much emphasis on two aspects: dialogue interaction modeling and dialogue participant modeling, which are in line with the prominent characteristics of conversational data. To model the interaction, graph modeling strategies combined with additional features are widely adopted. Zhao et al. [2020] utilize fine-grained topic words as bridges between utterances to construct a topic-word guided dialogue graph. consider the inter-utterance dialogue discourse structure and intra-utterance action triples to explicitly model the interaction. Feng et al. [2021a] view commonsense knowledge as cognitive interactive signals behind different utterances and shows the effectiveness of the integration of knowledge and heterogeneity modeling for different types of data. explicitly incorporate coreference information in dialogue summarization models. It is worth noting that they conduct data postprocessing to reduce incorrect coreference assignments caused by document coreference resolution model. To model the participants, Lei et al. [2021] implicitly model complex relationships among participants and their relative personal pronouns via speaker-aware self-attention mechanism. From another perspective, Narayan et al. [2021] and explicitly adopt the guided summarization framework and introduce the participant information into the coarse-to-fine generation procedure, in which the final dialogue summary is controlled by a precedent, such as a sketch or named entities. As shown in the above works, current dialogue summarization systems usually encode the text with additional information. However, these annotations are usually obtained via open-domain toolkits, which are not suitable for dialogues, or require manual annotations, which are labor-consuming. Therefore, Feng et al. [2021c] present an unsupervised Di-aloGPT annotator, which can perform three dialogue-specific annotation tasks, including keywords extraction, redundancy detection and topic segmentation. Despite the encouraging results reported, current models still suffer from the data-insufficient problem. Accordingly, some researchers study this task in the low-resource regime. Gunasekara et al. [2021] innovatively explore the summaryto-dialogue generation problem and verify the augmented dialogue-summary pairs can do good to dialogue summarization. propose three conversational data augmentation methods to enrich the data, including random swapping or deletion utterances, dialogue-actsguided utterance insertion and conditional-generation-based utterance substitution. Leaderboard: Previous works have already achieved remarkable success on the SAMSum dataset [Gliwa et al., 2019] . However, due to the different versions of ROUGE evaluation package, there lacks benchmark results unifying all the scores. To this end, we present benchmark results using py-rouge package 2 . The results are shown in Table 3 . Highlight: Thanks to the pre-trained language models, current methods are skilled at transforming the original chat into a simple summary realization. However, they still have difficulty selecting the important parts and tend to generate hallucinations. In the future, powerful chat modeling strategies and reasoning abilities should be explored for this task, and more low-resource settings should be considered. Email thread is an asynchronous multi-party communication consisting of a coherent exchange of email messages among several participants, which is widely used in the enterprise, academic and work settings. Compare with other types of dialogue, email has some unique characteristics. Firstly, it associates with the meta-data, including sender, receiver, main body and signature. Secondly, the email message always represents the intent of the sender, contains action items and may use quote to highlight the important part. Thirdly, unlike face-to-face spoken dialogue, replies in the email do not happen immediately. Such asynchronous nature may result in 54.05 28.56 50.57 BART(DALL) [Feng et al., 2021c] 53.70 28.79 50.81 Coref-ATTN 53.93 28.58 50.39 Entity-Plan ] † 56.53 32.40 54.92 Table 3 : Leaderboard of chat summarization on the SAMSum dataset [Gliwa et al., 2019] , where "R" is short for "ROUGE". We mainly adopt results from corresponding publications. Besides, the results of S-BART, MV-BART, Coref-ATTN and Entity-Plan are obtained by evaluating output files provided by the author. † indicates the model obtains these results with the help of golden summaries. messages containing long sentences. To deal with email overload, email service providers seek for efficient summarization techniques to improve the user experience. Major efforts lie on email thread summarization. Pioneer works present publicly available datasets to facilitate this task. Carenini et al. [2007] collect 39 email threads from Enron email dataset and annotate them with extractive summaries. They propose an email fragment quotation graph based on the occurrence of clue words and conduct extractive summarization. Notably, quotation plays an important role in the email that can directly highlight the salient part of the previous email. To enrich the annotation, Ulrich et al. [2008] collect 40 email threads from W3C email dataset and annotate them with both abstractive and extractive summaries along with meta sentences, subjectivity and speech acts. Loza et al. [2014] collect 107 email threads from Enron email dataset and annotate them with extractive and abstractive summaries combined with key phrases. Recently, Zhang et al. [2021a] present EMAILSUM, which contains 2549 email threads collected from Avocado Research Email Collection associated with human-written short and long abstractive summaries. This large-scale and high-quality dataset provides opportunities to data-hungry neural models. In light of emails always being used for workflow organization and task tracking, some works explore actionfocused email summarization, aka TO-DO item generation, which can help users with task management over emails. Mukherjee et al. [2020] propose a Smart TO-DO system, which first detects commitment sentences and then generates to-do items using sequence-to-sequence models. Highlight: Email is a specific genre of dialogue, which aims to organize the workflow. Therefore, an email frequently proposes requests, makes commitments and contains action items, which make the email intent understanding of vital importance. Future works should pay more attention to understanding the fine-grained action items in the email and the coarse-grained intent of the entire email. Besides, better use of quotations can yield significant benefits. Customer service is the direct one-on-one interaction between a customer and an agent, which frequently happens before and after the consumer behavior. Thus, it is important for growing business. To make the customer service more effective, automatic summarization is one way, which can provide the agent with quick solutions according to the previous condensed summary. Therefore, customer service summarization gains a lot of research interest in recent years. On the one hand, participants in customer service have strong intent and clear motivations to address issues, which makes the customer service inherently logical and surrounds specific topics. To this end, some works explore topic modeling for this task. Liu et al. [2019a] employ a coarse-to-fine generation framework, which first generates a sequence of key points (topics) to indicate the logic of the dialogue and then realize the detailed summary. For example, a key point sequence can be question→solution→user approval→end, which clearly shows the evolution of the dialogue. Instead of using explicitly pre-defined topics, Zou et al. [2021b] draw support from neural topic modeling and propose a multi-role topic modeling mechanism to explore implicitly topics. To alleviate data-insufficient problems, Zou et al. [2021a] propose an unsupervised framework called RankAE, in which topic utterances are first selected according to centrality and diversity simultaneously, and the denoising auto-encoder is further employed to produce final summaries. On the other hand, customer service is a kind of taskoriented dialogue, which contains informative entities, covers various domains and involves two distinct types of participants. To integrate dialogue-specific information, Zhao et al. [2021] craft a new dataset annotated with dialogue state knowledge, which is helpful for tracking the fine-grained dialogue information flow and generating faithful summaries. Since participants in customer service play distinct roles, in addition to the overall summary for the whole dialogue, Zhang et al. [2021b] propose an unsupervised framework based on variational auto-encoder to generate summaries for the customer and the agent respectively. directly propose CSDS datasets annotated with role-oriented summaries to acquire different speakers' viewpoints. Highlight: Customer service aims to address the questions raised by agents. Therefore, it naturally has strong motivations, which makes the dialogue have a specific way of evolution following the interaction between two participants with distinctive characteristics: the customer and the agent. Thus, modeling participant roles, evolution chains and inherent topics are important for this task. Besides, some fine-grained in-formation also should be taken into consideration to ensure faithfulness, such as slots, states and intents. Medical dialogues happen between patients and doctors. During this process, doctors are required to record a digital version of a patient's health records, namely electronic health records (EHR), which leads to both patient dis-satisfaction and clinician burnout. To mitigate the above challenge, medical dialogue summarization is coming to the rescue. From a coarse-grained perspective, a medical dialogue can be divided into several coherent segments according to different criteria. Liu et al. [2019b] specify the dialogue topics according to the symptoms, such as headache and cough, and design a topic-level attention mechanism to make the decoder focus on one symptom when generating one summary sentence. Kazi and Kahanda [2019] instead choose EHR categories to label each segment, such as family history and medical history. Specifically, Krishna et al. [2021] name the medical dialogue summary SOAP note, which stands for Subjective information reported by the patient; Objective observations; Assessments made by the doctor; and a Plan for future care, including diagnostic tests and treatments. From a fine-grained perspective, several medical dialogue characteristics should be handled carefully. Firstly, questionanswer pairs are the major discourse in medical dialogue and negations scattered in different utterances are notable parts. To this end, Joshi et al. [2020] encourage the model to focus on negation words via negation word attention and explicitly employ a gate mechanism to generate the [NO] word. Secondly, medical terminologies play an essential part in medical dialogues. Joshi et al. [2020] leverage the compendium of medical concepts, known as unified medical language systems to identify the presence of terminologies and further use an indicator vector to influence the attention distribution. Thirdly, the medical dialogue summary mainly describes core items and concepts in the dialogue, therefore, the summarization methods should bias towards extractive methods while keeping the advantages of abstractive methods. Enarvi et al. [2020] and Joshi et al. [2020] both enhance the copy mechanism to facilitate copying from the input. Highlight: Medical dialogue summarization mainly aims at helping doctors to quickly finish electronic health records and the medical dialogue summary should be more faithful rather than creative. Therefore, extractive methods combined with simple abstractive methods are preferred. The topic information can serve as a guideline for generating semistructured summaries. Besides, terminologies and negations in the medical dialogue should be handled carefully. Section 3 mainly introduces prominent achievements in different domains respectively. In this section, we will discuss some new frontiers which meet actual application needs and fit in with real-world scenarios. Even though current state-of-the-art summarization systems have already made great progress, they still suffer from the factual inconsistency problem, which distorts or fabricates the factual information in the article and is also known as hallucinations [Tang et al., 2021a] . Tang et al. [2021b] systematically study the taxonomy of factuality errors for dialogue summarization, which includes the following 8 error types: Missing Information, Redundant Information, Circumstantial Error, Wrong Reference Error, Negation Error, Object Error, Tense Error and Modality Error. Specifically, the last five types of errors notoriously tend to appear in dialogue summaries, which largely hinder the application of dialogue summarization systems. To remedy these issues, future works need specific designs target for the above errors. Importantly, fine-grained dialogue-specific features need to be incorporated into the summarization model, such as personal pronoun information, coreference information and tense information. On the one hand, these features can implicitly alleviate the difficulty of dialogue understanding. On the other hand, some features can directly serve as the explicitly extracted information to help final summary generation. Dialogues tend to occur in multi-modal situations, such as audio-visual recordings of meetings. Besides verbal information, non-verbal information can either supplement existing information or provide new information, which effectively enriches the representation of purely textual dialogues. According to whether different modalities can be aligned, the types of multi-modal information can be divided into two categories: synchronous and asynchronous. Synchronous multi-modal dialogues mainly refer to meetings, which may contain textual transcripts, prosodic audios and visual videos. On the one hand, taking the aligned audio and video into consideration can enhance the representation of transcripts. On the other hand, both the audio and video can provide new insights, such as a person entering the room to join the meeting or an emotional discussion. However, facial features and voiceprint features have already become superior privacy for individuals, which makes them hard and sensitive to be acquired. Future works can consider multi-modal meeting summarization under the federal learning framework. Asynchronous multi-modal dialogues refer to different modalities that happen at different times. With the development of communication technology, multi-modal messages, such as voice messages, pictures and emoji are frequently used in chat dialogues via applications like Messenger, What-sApp and WeChat. These messages provide rich information, serving as one part of the dialogue flow. Future works should consider textual information of voice messages obtained via ASR systems, new entities provided by pictures and emotions associated with the emoji to produce meaningful summaries. Multi-domain learning can mine shared information between different domains and further help the task of a specific domain, which is an effective learning method suitable for low-resource scenarios. Thanks to diverse summarization datasets, there are already some works exploring the multidomain learning or domain adaption for dialogue summarization . We divide this direction into two categories: macro multi-domain learning and micro multidomain learning. Macro multi-domain learning aims to use general domain summarization datasets, like news and scientific papers, to help the dialogue summarization task. The basis for this learning method to work is that no matter what data type they belong to, they aim to pick the core content of the original text. However, dialogues have some unique characteristics like more coreferences and participant-related features. Therefore, directly using these general datasets may reduce their effectiveness. Future works can first inject some dialogue-specific features, like replacing names with personal pronouns or transform the original general domain documents into turn-level documents at surface level, to further utilize these datasets. Micro multi-domain learning aims to use dialogue summarization datasets to help one specific dialogue summarization task. For example, using meeting datasets to help with email tasks. As shown in Table 1 , diverse dialogue summarization datasets covering various domains have been proposed in recent years. Future works can adopt meta-learning methods or rely on pre-trained language models to unify different datasets and mine common features. With the acceleration of globalization, a dialogue involving multinational participants becomes increasingly common thanks to the sophisticated instantaneous translation system. Therefore, there is an urgent need for providing people with dialogue summaries in a preferred language. However, current works overwhelmingly focused on English, while leaving other languages under exploration. We argue that the current dilemma is mainly caused by the intractable access to available multi-lingual data resources. Firstly, future works should devote efforts to creating a suitable testbed for multi-lingual dialogue summarization. As an initial step, Mehnaz et al. [2021] transform English utterances in the SAMSum dataset into Hindi-English utterances and study the chat summarization under the codeswitched setting. From a higher point of view, large-scale high-quality datasets covering diverse languages should be carefully crafted. Practically speaking, on the one hand, researchers can translate one specific dataset into different languages followed by automatic and human quality checking to get aligned datasets. On the other hand, researchers can also borrow ideas from unsupervised multi-lingual learning to utilize currently available datasets in different languages. Secondly, future works should set up systematic settings for this multi-lingual research, including one-to-one, one-to-many, many-to-one and many-to-many, in which one-to-one setting can be further divided into mono-lingual setting and crosslingual setting [Wang et al., 2022] . Thirdly, plenty of multilingual pre-trained language models can be explored for this task. Especially, models that have already been fine-tuned on the translation datasets may bring significant benefits This article presents the first comprehensive survey on the progress of dialogue summarization carefully and widely. We thoroughly summarize the existing works, which cover various domains and highlight their challenges respectively. Besides, we summarize currently available datasets and organize two leaderboards. Furthermore, we shed light on some new trends in this research field. We hope this survey can facilitate the research of the dialogue summarization. Summarizing email conversations with clue words The ami meeting corpus: A pre-announcement Multi-view sequence-to-sequence models with conversational structure for abstractive dialogue summarization Structureaware abstractive conversation summarization via discourse and action graphs DialogSum: A real-life scenario dialogue summarization dataset ConvoSumm: Conversation summarization benchmark and improved abstractive summarization with argument mining Dialogue discourse-aware graph model and data augmentation for meeting summarization Abstractive dialogue summarization with sentence-gated modeling optimized by dialogue acts Andreas Stolcke, et al. The icsi meeting corpus Xavier Amatriain, and Anitha Kannan. Dr. summarize: Global summarization of medical dialogue by exploiting local structures Automatically generating psychiatric case notes from digital transcripts of doctor-patient conversations How domain terminology affects meeting summarization performance Generating SOAP notes from doctor-patient conversations using modular summarization techniques BART: Denoising sequenceto-sequence pre-training for natural language generation, translation, and comprehension Controllable neural dialogue summarization with personal named entity planning Summarunner: A recurrent neural network based sequence model for extractive summarization of documents Unsupervised abstractive meeting summarization with multi-sentence compression and budgeted submodular maximization Confit: Toward faithful dialogue summarization with linguistically-informed contrastive fine-tuning. ArXiv Pontus Stenetorp, and Caiming Xiong. Controllable abstractive dialogue summarization with sketch supervision Ahmed Hassan Awadallah, and Dragomir Radev. An exploratory study on long dialogue summarization: What works and what's next We thank all the anonymous reviewers for their insightful comments. We would like to thank Alexander R. Fabbri, Jiaao Chen and Zhengyuan Liu for sharing their systems' outputs. We would also like to thank Shiyue Zhang for her feedback on email summarization and Libo Qin for his helpful discussion. This work was supported by the National Key RD Program of China via grant 2020AAA0106502, National Natural Science Foundation of China (NSFC) via grant 61976073 and Shenzhen Foundational Research Funding (JCYJ20200109113441941).