key: cord-0474573-g5x5iynw authors: Shi, Lin; Jiang, Ziyou; Yang, Ye; Chen, Xiao; Zhang, Yumin; Mu, Fangwen; Jiang, Hanzhi; Wang, Qing title: ISPY: Automatic Issue-Solution Pair Extraction from Community Live Chats date: 2021-09-15 journal: nan DOI: nan sha: 9ba615bb905f07d3b776c136ff6e8fae7dffb9b8 doc_id: 474573 cord_uid: g5x5iynw Collaborative live chats are gaining popularity as a development communication tool. In community live chatting, developers are likely to post issues they encountered (e.g., setup issues and compile issues), and other developers respond with possible solutions. Therefore, community live chats contain rich sets of information for reported issues and their corresponding solutions, which can be quite useful for knowledge sharing and future reuse if extracted and restored in time. However, it remains challenging to accurately mine such knowledge due to the noisy nature of interleaved dialogs in live chat data. In this paper, we first formulate the problem of issue-solution pair extraction from developer live chat data, and propose an automated approach, named ISPY, based on natural language processing and deep learning techniques with customized enhancements, to address the problem. Specifically, ISPY automates three tasks: 1) Disentangle live chat logs, employing a feedforward neural network to disentangle a conversation history into separate dialogs automatically; 2) Detect dialogs discussing issues, using a novel convolutional neural network (CNN), which consists of a BERT-based utterance embedding layer, a context-aware dialog embedding layer, and an output layer; 3) Extract appropriate utterances and combine them as corresponding solutions, based on the same CNN structure but with different feeding inputs. To evaluate ISPY, we compare it with six baselines, utilizing a dataset with 750 dialogs including 171 issue-solution pairs and evaluate ISPY from eight open source communities. The results show that, for issue-detection, our approach achieves the F1 of 76%, and outperforms all baselines by 30%. Our approach achieves the F1 of 63% for solution-extraction and outperforms the baselines by 20%. messages are reporting issues, and 51% of the messages are proposing alternative issue solutions. As a result, live chat repositories usually contain rich information to shed light on knowledge regarding frequent issue-solution pairs. Fig. 1 illustrates an example slice of live chat data. In this scenario, Tom encountered trouble when setting up dependencies for spark, so he posted an issue in live chatting. Jack and Mike both provided solutions as well as suggested examples. With their help, Tom finally resolved that issue. From this conversation slice, we can extract the issue description, as highlighted in red, and alternative solutions colored in blue. However, it is quite challenging to mine issue-solution pairs from live chats due to the following barriers. (1) Entangled dialogs. Live chat data gets big rapidly, and multiple concurrent discussions regarding different issues frequently exist in an interleaved manner. In order to perform any kind of dialog-level analysis, it is essential to have automated support for identifying and dividing sequential utterances into a set of distinct dialogs, according to the issue topics. (2) Expensive human effort. Chat logs are typically highvolume and contain informal dialogs covering a wide range of technical and complex topics. It is necessary to leverage manual annotation to guide the construction and training of learning-based algorithms. However, the manual annotation process requires experienced analysts to spend a large amount of time so that they can understand the dialogs thoroughly. Thus, it is very expensive to classify issue-related dialogs. (3) Noisy data. There exist noisy utterances such as duplicate and unreadable messages in the chat log that do not provide any valuable information. The noisy data poses a difficulty to analyze and interpret the communicative dialogs. In this paper, we propose a novel approach, named ISPY (extracting Issue-Solution Pairs from communitY live chats) to automatically extract issue-solution pairs from development community live chats. ISPY addresses the problem with three elaborated sub-tasks: 1) Disentangle live chat logs, employing a feedforward neural network to automatically disentangle a conversation history into separate dialogs; 2) Detect dialogs that are discussing issues, using a novel convolutional neural network (CNN), which consists of a BERT-based utterance embedding layer, a context-aware dialog embedding layer, and an output layer; and 3) Extract appropriate utterances and combine them as corresponding solutions, based on the same CNN structure but with different feeding inputs. To evaluate ISPY, we first collect and utilize a dataset with 750 dialogs including 171 issue-solution pairs, and evaluate ISPY from eight Gitter communities. The results show that, for issue-detection, our approach achieves the F1 of 76%, and outperforms baselines by 30%. For solution-extraction, our approach achieves the F1 of 63%, and outperforms baselines by 20%. Furthermore, we apply ISPY on three new communities to extensively evaluate its practical usage. ISPY helps provide solutions for 26 recent issues posted on Stack Overflow. Adding up the 21K pairs extracted from the former eight communities, we publish over 30K issue-solution pairs extracted from 11 communities in total. We believe that ISPY can facilitate community-based software development by promoting knowledge sharing and shortening the issue-resolving process. The major contributions of this paper are: • We formulate the problem of issue-solution pair extraction from developer live chat data. To the best of our knowledge, this is the first study exploring this problem. • We propose an automated approach, named ISPY, based on a convolutional neural network and introduces several customized improvements to effectively handle the characteristics of this task. • We evaluate the ISPY by comparing with six baselines, with superior performance. • We open-source a replication package and a large dataset with over 30K issue-solution pairs extracted by tool from 11 active communities on our website: https://github.com/ jzySaber1996/ISPY. In the remainder of the paper, Section II illustrates the problem definition. Section III presents the approach. Section IV sets up the experiments. Section V describes the results and analysis. Section VI illustrates the practical usage. Section VII is the discussion and threats to validity. Section VIII introduces the related work. Section IX concludes our work. Three main concepts about community live chats are concerned with this study's scope, including chat log, utterance, and dialog. Developer conversations in one chatting room are recorded in a chat log. As illustrated in Fig. 1 , a typical live chat log contains a sequential set of utterances in chronological order. Each utterance consists of a timestamp, developer id, and a textual message initiating a question or responding to an earlier message. A chat log includes all the utterances sent among participants who have been chatting in the room. Typically, it contains a large number of utterances, and the utterances might be responding to different threads of conversations. We define dialog as the conversation between two or more participants toward exploring a particular subject (e.g., resolution of a problem). Equation (1) provides the definitions for the three main concepts. Specifically, a chat log Chat log corresponds to a sequence of n utterances in chronological order. A dialog D is a subset of Chat log, containing only those utterances responding to the same subject s, which can be determined by clustering techniques [11] , [12] or probability distribution estimation methods [13] - [15] . S(u i ) denotes the subject of utterance u i , and each utterance u consists of the timestamp, developer id, and textual message. Our work automatically targets extracting issue-solution pairs from community live chats. First, we divide one dialog D into two parts: Head and Body. where Head (also marked as u h ) is the concatenation of all the utterances that the dialog initiator posts before the first reply from other developers. Body is the set of the remaining utterances. Dialog D is their joint set. Based on this division, we introduce a simplification assumption that the issue descriptions appear at the head utterances authored by the dialog initiator, while solution utterances are likely to appear afterward. Following these concepts and assumption, we formulate the problem of automatic issue-resolution pair extraction with three elaborated sub-tasks: 1) Dialog disentangle: Given the historical chat log Chat log, disentangle it into separate dialogs {D 1 , D 2 , ..., D n }. 2) Issue detection: Given a separate dialog D i , find a binary function f so that f (Head i ) can determine whether the dialog head depicts issue. 3) Solution extraction: Given a dialog D i involving issue discussion, find a function g so that g(Body i ) = {u s1 , u s2 , ..., u sm }, where u si is the utterance within the dialog suggesting potential solutions. Therefore, the output of our approach is a set of issuesolution pairs. Ideally, users do not need other information (e.g., the utterances between them) to understand these pairs. The construction of ISPY consists of four main steps, as illustrated in Fig. 2 . The first step includes data preprocessing and dialog disentanglement using a feedforward model. The second step is to construct an utterance embedding layer, which embeds tokenized utterances into vectors with local window context. The third step is to construct a dialog embedding layer, which defines and extracts three sets of features characterizing the context of potential issues or solutions. The fourth step is the output layer, which predicts two outputs, i.e., the possibility of issue description, and the possibility of solutions, for the corresponding inputs. By feeding dialog head and body into ISPY separately, we can obtain two models: issue model and solution model. Finally, ISPY apply the issue model and solution model for constructing pairs. A. Dialog Disentanglement 1) Data Preprocessing: For data preprocessing, we first follow the standard pipeline of stopword removal, typo correction, lowercase conversion, and lemmatization with Spacy [17] . Additionally, due to the unique characteristics of live chat data, we employ additional data preprocessing techniques to handle special issues. Specifically, we first replace lowfrequency tokens such as URL, email address, code, HTML tag, and version number with specific tokens [URL], [EMAIL], [HTML], [CODE] and [ID] respectively. Second, we replace the acronym words with their full names by referring to the Oxford abbreviation library [18] . Following previous work [19] [20], we normalize the emojis with specific strings to standard ASCII strings. Finally, we combine consecutive utterances that are broken from one sentence according to the perplexity scores [21] calculated by Baidu AI Cloud [22] . Following the experience from a recent study [23] , we use the perplexity scores lower than 40 as the threshold value to combine broken sentences. 2) Dialog Disentanglement Model: Utterances from a single conversation thread are usually interleaved with other ongoing conversations. In this step, we focus on dividing chat utterances into a set of distinct conversations, leveraging on Kummerfeld et al.'s technique [24] . Their model is trained from 77,563 manually annotated utterances of disentangled dialogs from online chatting. It is a feedforward neural network with 2 layers, 512-dimensional hidden vectors, and softsign non-linearities. The input of the model is a 77-dimensional Occurrence of "What", "Why", "When", "Who", "Which", and "How". {what=1, why=0, when=0, who=0, which=0, how=0} Punctuation Occurrence of "?" and "!". {"?"=1, "!"=0} Greeting Occurrence of greeting words: "Hello", "Good Morning", "Hi Guys", etc. False Disapproval Occurrence of disapproval words: "no", "can't work", "break down" etc. False Mention Occurrence of "simi-" and "same". This layer aims to encode not only textual information of utterances but also capture their contextual information. Utterance Encoding. First, for all the utterances in one dialog D = [u h , u b1 , u b2 , ..., u bn ], we encode it using a pretrained BERT model [25] , as BERT has been proved to be successful in many natural language processing tasks [26] , [27] . The BERT model is a bidirectional transformer using a combination of Masked Language Model and Next Sentence Prediction. It is trained from English Wikipedia with nearly 2,500M words. The BERT embedding layer outputs u ∈ R d , which is an 800-dimensional vector for each utterance. Local Window Context. Second, we model the contextual information of an utterance through the concept of the local window, and use the size of the local window as a hyperparameter. Intuitively, the consecutive reply of an issue utterance may be very different from that of non-issue ones. Therefore, we construct a local window context to characterize the dynamic contextual information for extracting desired utterances in a dialog. Specifically, we use a fixed-length local window to integrate context, and define the local window of the utterance u i as win i by joining the u i with its preceding and following k neighbor utterances. The fixed length is 2k+1. When the windows are out of bound, we utilize the Zero Padding [28] to map the fixed length. In this study, we choose k = 1 for the local window. This layer aims to encode utterance features in a multifaced way to more comprehensively represent the live chat context. To achieve that, we define and extract features from three categories, including textual, heuristic, and contextual features. To learn basic textual features for each utterance, we first represent utterances using TextCNN [29] . It is a classical method for sentence modeling by using a shallow Convolution Neural Network (CNN) [30] to model sentence representation. It has an advantage over learning on insufficient labeled data, since it employs a concise network structure and a small number of parameters. TextCNN uses several convolution kernels to capture local information as the receptive field. Then the global representation is produced with the local information. Given a kernel w ∈ R h with kernel size h and word embedding x = u i , one convolution feature γ t is generated with x t:t+h−1 : where b ∈ R is the bias parameter, and ReLU is the activate function. We concatenate all the γ t as a feature map: where the vector γ ∈ R n−h+1 . We use Max-Pooling strategy to calculateγ = max( γ). We set the number of kernels as m, and input u i into three Convolution-Pooling layers. The kernel number of the three layers are 1024, 512, and 256, respectively. The output of each layer is Γ ∈ R m , which is a 256-dimensional textual feature vector. 2) Heuristic Attribute Extractor: The heuristic attribute extractor aims to augment the dialog embedding results by incorporating high-level semantic attributes from five aspects: Keyword, Structure, Sentiment, Topic, and Role, as elaborated in Table I. (1) Keyword: The occurrences of indicating words or characters about 5W1H, punctuation, etc. (2) Structure: The structural characteristics of utterances in a dialog, such as the number of tokens and positions of the utterances. (3) Topic: The intuition of this feature is to distinguish off-topic utterances. We first calculate TF-IDF [31] for each unique word in the entire chat. Then We extract the top-10 most frequent words and combine them as a 10-dimensional topic vector T D c . Similarly, we also extract top-10 most frequent words from dialog head ( T D h ) and the given utterance ( T D u ). Finally, we calculate the Euclidean Distance [32] between them as the topic deviation: (4) Sentiment: The sentiment information of the given utterance in terms of positive, intermediate, and negative. (5) Role: The role of the participant who posts the utterance. By concatenating the above heuristic attributes, this extractor output a 29-dimensional vector. 3) Contextual Feature Extractor: Contextual feature extractor aims to embed the contextual information for each utterance. We use Local Attention [33] to represent the context. The Local Attention mechanism mainly focuses on the impact of the neighbor utterances locate in the same window. Local Attention can use low time-memory cost to highly represent the semantic context. An attention function can be described as mapping a query and a set of key-value pairs to an output [34] , which is a triple: ( h Q , h K , h V ). The function uses the query vector h Q calculated by the given utterance u i ∈ R d to query the attention scores with key vector h K . The key vector h K can be calculated by each utterance u s ∈ R d within the local window (i − k ≤ s ≤ i + k). The attention weight vector is calculated by multiplying value vector h V and sum the attention value. Therefore, we define the trainable query matrix W Q ∈ R δ×d , key matrix W K ∈ R δ×d and value matrix W V ∈ R δ×d , and calculate the triple ( h Q , h K , h V ): where the output contextual weight vector is defined as c i ∈ R δ , which can be calculated by the following equations: Equation (9) shows the three processes of Local-Attention calculation: (1) Output attention vector with dot production as score( h Q , h K ) by multiplying Gaussian Distance between s and i; (2) Use Softmax to normalize the score vector; and (3) Apply the normalized score vector to calculate the local attention. We set the parameter d = 800, δ = 128, and obtain a 128-dimensional context vector. Finally, we concatenate the vectors that output by the three extractors into a 413-dimensional feature vector u . We input the feature vector u into two Full-Connected Layers (FC), and use two Softmax functions to calculate the probability of issue-description utterance and the corresponding solution utterances, respectively. where P (I|u h ) is the predicted probability of issue-description utterance, and P (S|u bi ) (u bi ∈ {u b1 , ..., u bn }) is the predicted probability of solution utterance. The Cross-Entropy Loss are applied with the two tasks when measuring the difference between truth and prediction. The two loss functions are defined as Loss I and Loss S : where y h and y i indicate the ground-truth labels of utterances. The issue model and solutions model are separately trained until convergence. When fully trained, for a given chat log, ISPY automates the three sub-tasks formulated in Section II: First, it performs dialog disentanglement; Second, for each disentangled dialog, it uses the issue model to predicts whether the dialog head is issue description; and Finally, if the issue model predicts positive, then it uses the trained solution model to predict which utterances can be selected into the solution. As a final output, it combines the predicted utterances as the corresponding solution to the issue. To evaluate the proposed ISPY approach, our evaluation specifically addresses three research questions: RQ1: What is the performance of ISPY in detecting issue dialogs from live chat data? RQ2: What is the performance of ISPY in extracting solutions for a given issue? RQ3: How does each individual component in ISPY contribute to the overall performance? A. Data Preparation 1) Studied Communities: Many OSS communities utilize Gitter [35] or Slack [36] as their live communication means. Considering the popular, open, and free access nature, we select studied communities from Gitter 1 . To identify studied communities, we select the Top-1 most participated communities from eight active domains, covering front end framework, mobile, data science, DevOps, blockchain platform, collaboration, web app, and programming language. Then, we collect the daily chat utterances from these communities. Gitter provides REST API [37] to get data about chatting rooms and post utterances. In this study, we use the REST API to acquire the chat utterances of the eight selected communities, and the retrieved dataset contains all utterances as of "2020-12". 2) Bootstrap Sampling: After dialog disentanglement, the number of separate chat dialogs is large. Limited by the human resource of labeling, we randomly sample 100 dialogs from each community. Then we excluded unreadable dialogs: 1) Dialogs that are written in non-English languages; 2) Dialogs that contain too much code or stack traces; 3) Low-quality dialogs such as dialogs with many typos and grammatical errors. 4) Dialogs that involve channel robots. However, the dataset is imbalanced in that the non-issue dialogs are much more than issue dialogs, as shown in Table II . Therefore, we apply an arbitrary bootstrap sampling strategy [38] for data balancing by randomly sampling issue dialogs with replacement until the number of issue dialogs and non-issue dialogs is balanced. 3) Ground-truth Labeling: For each sampled dialog, we first manually label whether its head discussed a certain issue. Then, for each issue-dialog, we label the utterances that should be included in the solution. The labeled results are used as the ground-truth dataset for performance evaluation. To guarantee the correctness of the labeling results, we built an inspection team, which consisted of four Ph.D. candidates. All of them are fluent English speakers, and have done either intensive research work with software development or have been actively contributing to open-source projects. We divided the team into two groups. The labeled results from the Ph.D. candidates were reviewed by others. When a labeled result received different opinions, we hosted a discussion with all team members to decide through voting. The average Cohen's Kappa about issue-dialog is 0.85, and the average Cohen's Kappa about solution-utterance is 0.83. In total, we collected 173,278 dialogs from eight opensource communities, and spent 720 person-hours on annotating 750 dialogs including 171 issue-solution pairs. Table II presents the detail of our dataset. It shows the number of participants (Par.), dialog (Dial.), and utterance (Utter.) for the entire population, as well as the number of issue and nonissue dialogs with the corresponding utterances for the sample population. Moreover, to contribute to the eight communities, we apply ISPY on the 173,278 dialogs. We extract and publish 21K issue-solution pairs on our website. The first two RQs require the comparison of ISPY with state-of-the-art baselines. Due to the slightly different focuses between RQ1 and RQ2, we employ three common baselines applicable for both, as well as three additional baselines for each RQ. This leads to a total of six baselines for each RQ. Common Baselines applicable for RQ1 and RQ2. The three commonly used machine-learning-based baselines are utilized to comprehensively examine the classification performance, i.e., Naive Bayesian (NB) [39] , Random Forest (RF) [40] , and Gradient Boosting Decision Tree (GBDT) [41] . Additional Baselines for detecting issues (RQ1). Casper [42] is a method for extracting and synthesizing user-reported mini-stories regarding app problems from reviews. We use utterances as the extracted events, and treat its second step, i.e., classify problems, as a baseline. We use the implementation provided by the original paper [43] . CNC PD [44] is the stateof-the-art learning technique to classify sentences in comments taken from online issue reports. They proposed a CNN [45] based approach to classify sentences into seven categories of intentions: Feature Request, Solution Proposal, Problem Discovery, etc. we treat the CNN classifier that predicts utterances as the Problem Discovery category as a baseline for detecting issues. DECA PD [46] is the state-of-the-art rulebased technique for analyzing development email content. It is used to classify the sentences of emails into problem discovery, solution proposal, information giving, etc., by using linguistic rules. We use the six linguistic rules [47] for identifying the "problem discovery" dialog-head as our baseline. Additional baselines for extracting solutions (RQ2). UIT [48] is a context-representative classifier that uses Glove [49] to embed words, and uses TextCNN to embed the utterance. The UIT classifies utterance into 12 categories. Specifically, we choose the "potential answer" classifier as a solutionextraction baseline. CNC SP is the Solution Proposal classifier in [44] . DECA SP is the set of 51 linguistic rules for identifying "solution proposal" sentences in [46] . We use three commonly-used metrics to evaluate the performance, i.e., Precision, Recall, F1. (1) Precision, which refers to the ratio of the number of correct predictions to the total number of predictions; (2) Recall, which refers to the ratio of the number of correct predictions to the total number of samples in the golden test set; and (3) F1, which is the harmonic mean of precision and recall. When comparing the performances, we care more about F1 since it is balanced for evaluation. Note that, since the number of utterances may largely vary across different dialogs, we calculate the performance of solution extraction in the scope of the community. For all experiments, we apply Cross-Project Evaluation on our dataset to perform the training process. We iteratively select one project as testset, and the remaining seven projects for training. The experiment environment is a Windows 10 desktop computer, NVIDIA GeForce RTX 2060 GPU, intel core i7, and 32GB RAM. To answer RQ1, we first train the issue dataset with 8 batch size. Each of the convolution and dense layers use 0.6 dropout to avoid overfitting. The optimizer chooses Adam=0.001 and β 1 =0.9. We train ISPY for 100 epochs with 5 P R F1 P R F1 P R F1 P R F1 P R F1 P R F1 P R F1 P R F1 P R F1 Issue ISPY 76 77 76 75 68 71 84 74 79 77 68 72 82 73 77 80 69 74 79 70 74 86 78 82 80 72 76 NB 36 40 38 41 30 35 47 36 41 70 56 62 08 25 13 22 42 29 30 50 37 15 40 22 34 40 36 RF 56 25 34 69 30 42 75 23 35 84 44 58 100 17 29 50 25 33 33 13 18 23 30 26 61 26 36 GBDT 27 75 40 40 70 51 50 79 61 73 44 55 21 76 33 19 67 29 30 88 44 18 90 30 35 65 46 Casper 39 35 37 08 03 05 59 26 36 46 40 43 19 42 26 14 17 15 05 06 06 15 40 22 26 26 26 CNC PD 20 55 29 23 50 32 23 36 28 12 32 17 24 42 30 12 42 19 10 50 17 05 40 10 16 43 24 DECA PD 33 50 40 28 37 31 33 36 34 64 28 39 42 42 42 44 67 53 32 50 39 04 10 06 35 40 For RQ3, we compare ISPY with its three variants: 1) ISPY-CNN, which removes the textual feature extractor from ISPY, 2) ISPY-Heu, which removes the hueristic attribute extractor from ISPY, and 3) ISPY-LocalAttn, which removes the contextual feature extractor from ISPY. Three variants use the same parameters when training. The upper half of Table III demonstrates the comparison results between the performance of ISPY and those of the six baselines across data from eight OSS communities, for issue detection tasks. The columns correspond to Precision, Recall, and F1 score. The highlighted cells indicate the best performances from each column. Then, we conduct the normality test and T-test between every two methods. Overall, the data follow a normal distribution (p = 0.32) 2 , and ISPY significantly (p = 10 −20 ) outperforms the six baselines in terms of the average Precision, Recall, and F1 score. Specifically, when comparing with the best Precision-performer among the six baselines, i.e., RF, ISPY can improve its average precision by 19%. Similarly, ISPY improves the best Recallperformer, i.e., GBDT, by 7% for average recall, and improves the best F1-performer, i.e., GBDT, by 30% for average F1 2 Significant test: p < 0.05 score. At the individual project level, ISPY can achieve the best performances on most of the eight communities. These results indicate that ISPY can more accurately detect whether a dialog is discussing an issue, than all comparison baselines. We believe that the performance advantage of ISPY is mainly attributed to the rich representativeness of its internal construction, from two perspectives: (1) ISPY can accurately capture the semantic relationship between the issue description and its first reply, by using the local window and local attention mechanism. This enables it to learn more comprehensive contextual knowledge, e.g., what kind of first issuereplies represents a dialog head containing issue descriptions. Therefore, it contributes to more accurate classification. (2) ISPY augments the textual vectors with high-level semantic information by employing 15 heuristic attributes. For example, three out of the 15 attributes are characterizing sentiment attributes. This design is based on the observation that issue descriptions are likely to contain negative tones such as "fail", "error", and "annoy". By calculating three types of polarity sentiment scores as shown in Table I , the sentiment attributes can be fit into the deep learning network to help with issuedescription detection. Answering RQ1: ISPY outperforms the six baselines in detecting issue dialogs across most of the studied projects, and the average Precision, Recall, and F1 are 80%, 72%, and 76%, respectively, improving the best F1-baseline GBDT by 30% on average F1 score. Similarly, the bottom part of Table III summarizes the comparison results between the performance of ISPY and those of the six baselines across data from eight OSS communities, for solution extraction task. We can see that, ISPY can achieve the highest performance in most of the columns. It significantly (p = 10 −5 ) outperforms the six baselines. On average, although ISPY are slightly below GBDT by 3% of Recall, it reaches the highest F1 score (63%), improving the best baseline RF by 20%. It also reaches the highest precision (68%), significantly higher than other baselines (i.e., ranging from 17% to 37%). These results imply that ISPY can effectively extract utterances as the corresponding solutions from development dialogs: (1) Our approach is sensitive to identifying solutions, including consecutive utterances, by ,VVXH3 3 3 3 3 3 3 3 3 ,VVXH5 3 3 3 3 3 3 3 3 ,VVXH) ,63< ,63 pairs from online forums based on SVM classifier and contentquality ranking. Cong et al. [68] proposed a sequential patternbased classification method to detect questions in a forum thread, and a graph-based propagation method to detect answers for questions in the same thread. Since previous studies extracted only questions in interrogative forms, Kwong et al. [69] extended the scope of questions and answer detection, and pairing to encompass also questioned in imperative and declarative forms. Henß et al. [70] presented an approach to extract FAQs from sources of software development mailing lists automatically. These approaches utilize the characteristics of their corpora and are best fit for their specific tasks, but they limit each of their corpora and tasks, so they cannot directly transform their methods to the task of extracting issue-solution pairs from community chats. IX. CONCLUSION In this paper, we propose an approach, named ISPY, to automatically extract issue-solution pairs from development community live chats. ISPY leverages a novel convolutional neural network by incorporating a basic CNN network with 15 heuristic attributes and Local-Attention mechanism to handle the characteristics of this task. We build a dataset with 750 dialogs, including 171 issue-solution pairs, and evaluate ISPY on it. The evaluation results show that our approach outperforms both issue-detection baselines and solution-extraction baselines by substantial margins. By applying ISPY, we also automatically generate a dataset with over 30K issue-solution pairs extracted from 11 community live chats, and we utilize the dataset to provide solutions for 26 recent issues posted on Stack Overflow. Exploratory Study of Slack Q&A Chats as a Mining Source for Software Engineering Tools Finding Help with Programming Errors: An Exploratory Study of Novice Software Engineers' Focus in Stack Overflow Posts Why Developers Are Slacking Off: Understanding How Software Teams Use Slack On the Use of Internet Relay Chat (IRC) Meetings by Developers of the GNOME GTK+ Project Studying the Use of Developer IRC Meetings in Open Source Projects How Was Your Weekend?' Software Development Teams Working From Home During COVID-19 An Empirical Study of Developer Discussions in the Gitter Platform A First Look at Developers' Live Chat on Gitter Rationale in Development Chat Messages: An Exploratory Study How do Developers Discuss Rationale? DialBERT: A Hierarchical Pre-Trained Model for Conversation Disentanglement End-to-End Transition-Based Online Dialogue Disentanglement A Large-Scale Corpus for Conversation Disentanglement Chat Disentanglement: Identifying Semantic Reply Relationships with Random Forests and Recurrent Neural Networks Online Conversation Disentanglement with Pointer Networks Explosion Abbreviations -Oxford English Dictionary Emojis influence emotional communication, social attributions, and information processing Emoji helps! A multi-modal siamese architecture for tweet user verification Perplexity of n-Gram and Dependency Language Models Exploring the Limits of Language Modeling A Large-Scale Corpus for Conversation Disentanglement BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Cost-Sensitive BERT for Generalisable Sentence Classification on Imbalanced Data Transfer Learning in Biomedical Named Entity Recognition: An Evaluation of BERT in the PharmaCoNER task User Intent Prediction in Information-seeking Conversations Convolutional Neural Networks for Sentence Classification Imagenet Classification with Deep Convolutional Neural Networks Term-weighting Approaches in Automatic Text Retrieval From Word Embeddings to Document Distances Neural Machine Translation by Jointly Learning to Align and Translate Attention is All You Need REST API Machine Learning, a Probabilistic Perspective A Comparison of Event Models for Naive Bayes Text Classification Classification and Regression by Random-Forest Lightgbm: A Highly Efficient Gradient Boosting Decision Tree Caspar: Extracting and Synthesizing User Stories of Problems from App Reviews Stories in App Reviews Automating Intention Mining Neural Networks: A Comprehensive Foundation Development Emails Content Analyzer: Intention Mining in Developer Discussions (T) UZH-s.e.a.l.-Development Emails Content Analyzer (DECA) Analyzing and Characterizing User Intent in Information-seeking Conversations Glove: Global Vectors for Word Representation Technical Q8A Site Answer Recommendation via Question Boosting Multiobjective Code Reviewer Recommendations: Balancing Expertise, Availability and Collaborations Automatically Recommending Peer Reviewers in Modern Code Review Considering Dependencies between Bug Reports to Improve Bugs Triage Automatic Bug Triage in Software Systems Using Graph Neighborhood Relations for Feature Augmentation Handbook on Ontologies A Semantic Web primer. A Semantic Web primer ERNIE 2.0: A Continual Pre-Training Framework for Language Understanding Softwarerelated Slack Chats with Disentangled Conversations Detection of Hidden Feature Requests from Massive Chat Messages via Deep Siamese Network Finding Questionanswer Pairs from Online Forums Detecting User Story Information in Developer-client Conversations to Generate Extractive Summaries Mining StackOverflow to Filter out Off-topic IRC Discussion Mining User Opinions in Mobile App Reviews: A Keyword-Based Approach (T) Phrase-based Extraction of User Opinions in Mobile App Reviews Online App Review Analysis for Identifying Emerging Issues Emerging App Issue Identification from User Feedback: Experience on WeChat Detection of Question-Answer Pairs in Email Conversations Extracting Chatbot Knowledge from Online Discussion Forums Detection of Imperative and Declarative Question-answer Pairs in Email conversations Semi-automatically Extracting FAQs to Improve Accessibility of Software Development Knowledge We deeply appreciate anonymous reviewers for their constructive and insightful suggestions towards improving this manuscript. This work is supported by the National Key