key: cord-0474573-g5x5iynw
authors: Shi, Lin; Jiang, Ziyou; Yang, Ye; Chen, Xiao; Zhang, Yumin; Mu, Fangwen; Jiang, Hanzhi; Wang, Qing
title: ISPY: Automatic Issue-Solution Pair Extraction from Community Live Chats
date: 2021-09-15
journal: nan
DOI: nan
sha: 9ba615bb905f07d3b776c136ff6e8fae7dffb9b8
doc_id: 474573
cord_uid: g5x5iynw

Collaborative live chats are gaining popularity as a development communication tool. In community live chatting, developers are likely to post issues they encountered (e.g., setup issues and compile issues), and other developers respond with possible solutions. Therefore, community live chats contain rich sets of information for reported issues and their corresponding solutions, which can be quite useful for knowledge sharing and future reuse if extracted and restored in time. However, it remains challenging to accurately mine such knowledge due to the noisy nature of interleaved dialogs in live chat data. In this paper, we first formulate the problem of issue-solution pair extraction from developer live chat data, and propose an automated approach, named ISPY, based on natural language processing and deep learning techniques with customized enhancements, to address the problem. Specifically, ISPY automates three tasks: 1) Disentangle live chat logs, employing a feedforward neural network to disentangle a conversation history into separate dialogs automatically; 2) Detect dialogs discussing issues, using a novel convolutional neural network (CNN), which consists of a BERT-based utterance embedding layer, a context-aware dialog embedding layer, and an output layer; 3) Extract appropriate utterances and combine them as corresponding solutions, based on the same CNN structure but with different feeding inputs. To evaluate ISPY, we compare it with six baselines, utilizing a dataset with 750 dialogs including 171 issue-solution pairs and evaluate ISPY from eight open source communities. The results show that, for issue-detection, our approach achieves the F1 of 76%, and outperforms all baselines by 30%. Our approach achieves the F1 of 63% for solution-extraction and outperforms the baselines by 20%.

messages are reporting issues, and 51% of the messages are proposing alternative issue solutions. As a result, live chat repositories usually contain rich information to shed light on knowledge regarding frequent issue-solution pairs. Fig. 1 illustrates an example slice of live chat data. In this scenario, Tom encountered trouble when setting up dependencies for spark, so he posted an issue in live chatting. Jack and Mike both provided solutions as well as suggested examples. With their help, Tom finally resolved that issue. From this conversation slice, we can extract the issue description, as highlighted in red, and alternative solutions colored in blue.

However, it is quite challenging to mine issue-solution pairs from live chats due to the following barriers. (1) Entangled dialogs. Live chat data gets big rapidly, and multiple concurrent discussions regarding different issues frequently exist in an interleaved manner. In order to perform any kind of dialog-level analysis, it is essential to have automated support for identifying and dividing sequential utterances into a set of distinct dialogs, according to the issue topics.

(2) Expensive human effort. Chat logs are typically highvolume and contain informal dialogs covering a wide range of technical and complex topics. It is necessary to leverage manual annotation to guide the construction and training of learning-based algorithms. However, the manual annotation process requires experienced analysts to spend a large amount of time so that they can understand the dialogs thoroughly. Thus, it is very expensive to classify issue-related dialogs. (3) Noisy data. There exist noisy utterances such as duplicate and unreadable messages in the chat log that do not provide any valuable information. The noisy data poses a difficulty to analyze and interpret the communicative dialogs.

In this paper, we propose a novel approach, named ISPY (extracting Issue-Solution Pairs from communitY live chats) to automatically extract issue-solution pairs from development community live chats. ISPY addresses the problem with three elaborated sub-tasks: 1) Disentangle live chat logs, employing a feedforward neural network to automatically disentangle a conversation history into separate dialogs; 2) Detect dialogs that are discussing issues, using a novel convolutional neural network (CNN), which consists of a BERT-based utterance embedding layer, a context-aware dialog embedding layer, and an output layer; and 3) Extract appropriate utterances and combine them as corresponding solutions, based on the same CNN structure but with different feeding inputs. To evaluate ISPY, we first collect and utilize a dataset with 750 dialogs including 171 issue-solution pairs, and evaluate ISPY from eight Gitter communities. The results show that, for issue-detection, our approach achieves the F1 of 76%, and outperforms baselines by 30%. For solution-extraction, our approach achieves the F1 of 63%, and outperforms baselines by 20%. Furthermore, we apply ISPY on three new communities to extensively evaluate its practical usage. ISPY helps provide solutions for 26 recent issues posted on Stack Overflow. Adding up the 21K pairs extracted from the former eight communities, we publish over 30K issue-solution pairs extracted from 11 communities in total. We believe that ISPY can facilitate community-based software development by promoting knowledge sharing and shortening the issue-resolving process. The major contributions of this paper are: • We formulate the problem of issue-solution pair extraction from developer live chat data. To the best of our knowledge, this is the first study exploring this problem. • We propose an automated approach, named ISPY, based on a convolutional neural network and introduces several customized improvements to effectively handle the characteristics of this task. • We evaluate the ISPY by comparing with six baselines, with superior performance. • We open-source a replication package and a large dataset with over 30K issue-solution pairs extracted by tool from 11 active communities on our website: https://github.com/ jzySaber1996/ISPY.

In the remainder of the paper, Section II illustrates the problem definition. Section III presents the approach. Section IV sets up the experiments. Section V describes the results and analysis. Section VI illustrates the practical usage. Section VII is the discussion and threats to validity. Section VIII introduces the related work. Section IX concludes our work.

Three main concepts about community live chats are concerned with this study's scope, including chat log, utterance, and dialog. Developer conversations in one chatting room are recorded in a chat log. As illustrated in Fig. 1 , a typical live chat log contains a sequential set of utterances in chronological order. Each utterance consists of a timestamp, developer id, and a textual message initiating a question or responding to an earlier message. A chat log includes all the utterances sent among participants who have been chatting in the room. Typically, it contains a large number of utterances, and the utterances might be responding to different threads of conversations. We define dialog as the conversation between two or more participants toward exploring a particular subject (e.g., resolution of a problem).

Equation (1) provides the definitions for the three main concepts. Specifically, a chat log Chat log corresponds to a sequence of n utterances in chronological order. A dialog D is a subset of Chat log, containing only those utterances responding to the same subject s, which can be determined by clustering techniques [11] , [12] or probability distribution estimation methods [13] - [15] . S(u i ) denotes the subject of utterance u i , and each utterance u consists of the timestamp, developer id, and textual message. Our work automatically targets extracting issue-solution pairs from community live chats. First, we divide one dialog D into two parts: Head and Body.

where Head (also marked as u h ) is the concatenation of all the utterances that the dialog initiator posts before the first reply from other developers. Body is the set of the remaining utterances. Dialog D is their joint set. Based on this division, we introduce a simplification assumption that the issue descriptions appear at the head utterances authored by the dialog initiator, while solution utterances are likely to appear afterward.

Following these concepts and assumption, we formulate the problem of automatic issue-resolution pair extraction with three elaborated sub-tasks: 1) Dialog disentangle: Given the historical chat log Chat log, disentangle it into separate dialogs {D 1 , D 2 , ..., D n }. 2) Issue detection: Given a separate dialog D i , find a binary function f so that f (Head i ) can determine whether the dialog head depicts issue. 3) Solution extraction: Given a dialog D i involving issue discussion, find a function g so that g(Body i ) = {u s1 , u s2 , ..., u sm }, where u si is the utterance within the dialog suggesting potential solutions. Therefore, the output of our approach is a set of issuesolution pairs. Ideally, users do not need other information (e.g., the utterances between them) to understand these pairs.

The construction of ISPY consists of four main steps, as illustrated in Fig. 2 . The first step includes data preprocessing and dialog disentanglement using a feedforward model. The second step is to construct an utterance embedding layer, which embeds tokenized utterances into vectors with local window context. The third step is to construct a dialog embedding layer, which defines and extracts three sets of features characterizing the context of potential issues or solutions. The fourth step is the output layer, which predicts two outputs, i.e., the possibility of issue description, and the possibility of solutions, for the corresponding inputs.

By feeding dialog head and body into ISPY separately, we can obtain two models: issue model and solution model. Finally, ISPY apply the issue model and solution model for constructing pairs.

A. Dialog Disentanglement 1) Data Preprocessing: For data preprocessing, we first follow the standard pipeline of stopword removal, typo correction, lowercase conversion, and lemmatization with Spacy [17] . Additionally, due to the unique characteristics of live chat data, we employ additional data preprocessing techniques to handle special issues. Specifically, we first replace lowfrequency tokens such as URL, email address, code, HTML tag, and version number with specific tokens [URL], [EMAIL], [HTML], [CODE] and [ID] respectively. Second, we replace the acronym words with their full names by referring to the Oxford abbreviation library [18] . Following previous work [19] [20], we normalize the emojis with specific strings to standard ASCII strings. Finally, we combine consecutive utterances that are broken from one sentence according to the perplexity scores [21] calculated by Baidu AI Cloud [22] . Following the experience from a recent study [23] , we use the perplexity scores lower than 40 as the threshold value to combine broken sentences.

2) Dialog Disentanglement Model: Utterances from a single conversation thread are usually interleaved with other ongoing conversations. In this step, we focus on dividing chat utterances into a set of distinct conversations, leveraging on Kummerfeld et al.'s technique [24] . Their model is trained from 77,563 manually annotated utterances of disentangled dialogs from online chatting. It is a feedforward neural network with 2 layers, 512-dimensional hidden vectors, and softsign non-linearities. The input of the model is a 77-dimensional Occurrence of "What", "Why", "When", "Who", "Which", and "How". {what=1, why=0, when=0, who=0, which=0, how=0} Punctuation

Occurrence of "?" and "!". {"?"=1, "!"=0} Greeting Occurrence of greeting words: "Hello", "Good Morning", "Hi Guys", etc. False Disapproval

Occurrence of disapproval words: "no", "can't work", "break down" etc. False Mention

Occurrence of "simi-" and "same". 

This layer aims to encode not only textual information of utterances but also capture their contextual information.

Utterance Encoding. First, for all the utterances in one dialog D = [u h , u b1 , u b2 , ..., u bn ], we encode it using a pretrained BERT model [25] , as BERT has been proved to be successful in many natural language processing tasks [26] , [27] . The BERT model is a bidirectional transformer using a combination of Masked Language Model and Next Sentence Prediction. It is trained from English Wikipedia with nearly 2,500M words. The BERT embedding layer outputs u ∈ R d , which is an 800-dimensional vector for each utterance.

Local Window Context. Second, we model the contextual information of an utterance through the concept of the local window, and use the size of the local window as a hyperparameter. Intuitively, the consecutive reply of an issue utterance may be very different from that of non-issue ones. Therefore, we construct a local window context to characterize the dynamic contextual information for extracting desired utterances in a dialog. Specifically, we use a fixed-length local window to integrate context, and define the local window of the utterance u i as win i by joining the u i with its preceding and following k neighbor utterances. The fixed length is 2k+1.

When the windows are out of bound, we utilize the Zero Padding [28] to map the fixed length. In this study, we choose k = 1 for the local window.

This layer aims to encode utterance features in a multifaced way to more comprehensively represent the live chat context. To achieve that, we define and extract features from three categories, including textual, heuristic, and contextual features.

To learn basic textual features for each utterance, we first represent utterances using TextCNN [29] . It is a classical method for sentence modeling by using a shallow Convolution Neural Network (CNN) [30] to model sentence representation. It has an advantage over learning on insufficient labeled data, since it employs a concise network structure and a small number of parameters. TextCNN uses several convolution kernels to capture local information as the receptive field. Then the global representation is produced with the local information.

Given a kernel w ∈ R h with kernel size h and word embedding x = u i , one convolution feature γ t is generated with x t:t+h−1 :

where b ∈ R is the bias parameter, and ReLU is the activate function. We concatenate all the γ t as a feature map:

where the vector γ ∈ R n−h+1 . We use Max-Pooling strategy to calculateγ = max( γ). We set the number of kernels as m, and input u i into three Convolution-Pooling layers. The kernel number of the three layers are 1024, 512, and 256, respectively. The output of each layer is Γ ∈ R m , which is a 256-dimensional textual feature vector.

2) Heuristic Attribute Extractor: The heuristic attribute extractor aims to augment the dialog embedding results by incorporating high-level semantic attributes from five aspects: Keyword, Structure, Sentiment, Topic, and Role, as elaborated in Table I. (1) Keyword: The occurrences of indicating words or characters about 5W1H, punctuation, etc.

(2) Structure: The structural characteristics of utterances in a dialog, such as the number of tokens and positions of the utterances.

(3) Topic: The intuition of this feature is to distinguish off-topic utterances. We first calculate TF-IDF [31] for each unique word in the entire chat. Then We extract the top-10 most frequent words and combine them as a 10-dimensional topic vector T D c . Similarly, we also extract top-10 most frequent words from dialog head ( T D h ) and the given utterance ( T D u ). Finally, we calculate the Euclidean Distance [32] between them as the topic deviation:

(4) Sentiment: The sentiment information of the given utterance in terms of positive, intermediate, and negative.

(5) Role: The role of the participant who posts the utterance. By concatenating the above heuristic attributes, this extractor output a 29-dimensional vector.

3) Contextual Feature Extractor: Contextual feature extractor aims to embed the contextual information for each utterance. We use Local Attention [33] to represent the context. The Local Attention mechanism mainly focuses on the impact of the neighbor utterances locate in the same window. Local Attention can use low time-memory cost to highly represent the semantic context. An attention function can be described as mapping a query and a set of key-value pairs to an output [34] , which is a triple: ( h Q , h K , h V ). The function uses the query vector h Q calculated by the given utterance u i ∈ R d to query the attention scores with key vector h K . The key vector h K can be calculated by each utterance u s ∈ R d within the local window (i − k ≤ s ≤ i + k). The attention weight vector is calculated by multiplying value vector h V and sum the attention value. Therefore, we define the trainable query matrix W Q ∈ R δ×d , key matrix W K ∈ R δ×d and value matrix W V ∈ R δ×d , and calculate the triple ( h Q , h K , h V ):

where the output contextual weight vector is defined as c i ∈ R δ , which can be calculated by the following equations:

Equation (9) shows the three processes of Local-Attention calculation: (1) Output attention vector with dot production as score( h Q , h K ) by multiplying Gaussian Distance between s and i; (2) Use Softmax to normalize the score vector; and (3) Apply the normalized score vector to calculate the local attention. We set the parameter d = 800, δ = 128, and obtain a 128-dimensional context vector.

Finally, we concatenate the vectors that output by the three extractors into a 413-dimensional feature vector u .

We input the feature vector u into two Full-Connected Layers (FC), and use two Softmax functions to calculate the probability of issue-description utterance and the corresponding solution utterances, respectively.

where P (I|u h ) is the predicted probability of issue-description utterance, and P (S|u bi ) (u bi ∈ {u b1 , ..., u bn }) is the predicted probability of solution utterance. The Cross-Entropy Loss are applied with the two tasks when measuring the difference between truth and prediction. The two loss functions are defined as Loss I and Loss S :

where y h and y i indicate the ground-truth labels of utterances.

The issue model and solutions model are separately trained until convergence.

When fully trained, for a given chat log, ISPY automates the three sub-tasks formulated in Section II: First, it performs dialog disentanglement; Second, for each disentangled dialog, it uses the issue model to predicts whether the dialog head is issue description; and Finally, if the issue model predicts positive, then it uses the trained solution model to predict which utterances can be selected into the solution. As a final output, it combines the predicted utterances as the corresponding solution to the issue.

To evaluate the proposed ISPY approach, our evaluation specifically addresses three research questions: RQ1: What is the performance of ISPY in detecting issue dialogs from live chat data?

RQ2: What is the performance of ISPY in extracting solutions for a given issue?

RQ3: How does each individual component in ISPY contribute to the overall performance? A. Data Preparation 1) Studied Communities: Many OSS communities utilize Gitter [35] or Slack [36] as their live communication means. Considering the popular, open, and free access nature, we select studied communities from Gitter 1 .

To identify studied communities, we select the Top-1 most participated communities from eight active domains, covering front end framework, mobile, data science, DevOps, blockchain platform, collaboration, web app, and programming language. Then, we collect the daily chat utterances from these communities. Gitter provides REST API [37] to get data about chatting rooms and post utterances. In this study, we use the REST API to acquire the chat utterances of the eight selected communities, and the retrieved dataset contains all utterances as of "2020-12". 2) Bootstrap Sampling: After dialog disentanglement, the number of separate chat dialogs is large. Limited by the human resource of labeling, we randomly sample 100 dialogs from each community. Then we excluded unreadable dialogs: 1) Dialogs that are written in non-English languages; 2) Dialogs that contain too much code or stack traces; 3) Low-quality dialogs such as dialogs with many typos and grammatical errors. 4) Dialogs that involve channel robots. However, the dataset is imbalanced in that the non-issue dialogs are much more than issue dialogs, as shown in Table II . Therefore, we apply an arbitrary bootstrap sampling strategy [38] for data balancing by randomly sampling issue dialogs with replacement until the number of issue dialogs and non-issue dialogs is balanced.

3) Ground-truth Labeling: For each sampled dialog, we first manually label whether its head discussed a certain issue. Then, for each issue-dialog, we label the utterances that should be included in the solution. The labeled results are used as the ground-truth dataset for performance evaluation. To guarantee the correctness of the labeling results, we built an inspection team, which consisted of four Ph.D. candidates. All of them are fluent English speakers, and have done either intensive research work with software development or have been actively contributing to open-source projects. We divided the team into two groups. The labeled results from the Ph.D. candidates were reviewed by others. When a labeled result received different opinions, we hosted a discussion with all team members to decide through voting. The average Cohen's Kappa about issue-dialog is 0.85, and the average Cohen's Kappa about solution-utterance is 0.83.

In total, we collected 173,278 dialogs from eight opensource communities, and spent 720 person-hours on annotating 750 dialogs including 171 issue-solution pairs. Table  II presents the detail of our dataset. It shows the number of participants (Par.), dialog (Dial.), and utterance (Utter.) for the entire population, as well as the number of issue and nonissue dialogs with the corresponding utterances for the sample population. Moreover, to contribute to the eight communities, we apply ISPY on the 173,278 dialogs. We extract and publish 21K issue-solution pairs on our website.

The first two RQs require the comparison of ISPY with state-of-the-art baselines. Due to the slightly different focuses between RQ1 and RQ2, we employ three common baselines applicable for both, as well as three additional baselines for each RQ. This leads to a total of six baselines for each RQ.

Common Baselines applicable for RQ1 and RQ2. The three commonly used machine-learning-based baselines are utilized to comprehensively examine the classification performance, i.e., Naive Bayesian (NB) [39] , Random Forest (RF) [40] , and Gradient Boosting Decision Tree (GBDT) [41] .

Additional Baselines for detecting issues (RQ1). Casper [42] is a method for extracting and synthesizing user-reported mini-stories regarding app problems from reviews. We use utterances as the extracted events, and treat its second step, i.e., classify problems, as a baseline. We use the implementation provided by the original paper [43] . CNC PD [44] is the stateof-the-art learning technique to classify sentences in comments taken from online issue reports. They proposed a CNN [45] based approach to classify sentences into seven categories of intentions: Feature Request, Solution Proposal, Problem Discovery, etc. we treat the CNN classifier that predicts utterances as the Problem Discovery category as a baseline for detecting issues. DECA PD [46] is the state-of-the-art rulebased technique for analyzing development email content. It is used to classify the sentences of emails into problem discovery, solution proposal, information giving, etc., by using linguistic rules. We use the six linguistic rules [47] for identifying the "problem discovery" dialog-head as our baseline.

Additional baselines for extracting solutions (RQ2). UIT [48] is a context-representative classifier that uses Glove [49] to embed words, and uses TextCNN to embed the utterance. The UIT classifies utterance into 12 categories. Specifically, we choose the "potential answer" classifier as a solutionextraction baseline. CNC SP is the Solution Proposal classifier in [44] . DECA SP is the set of 51 linguistic rules for identifying "solution proposal" sentences in [46] .

We use three commonly-used metrics to evaluate the performance, i.e., Precision, Recall, F1. (1) Precision, which refers to the ratio of the number of correct predictions to the total number of predictions; (2) Recall, which refers to the ratio of the number of correct predictions to the total number of samples in the golden test set; and (3) F1, which is the harmonic mean of precision and recall. When comparing the performances, we care more about F1 since it is balanced for evaluation. Note that, since the number of utterances may largely vary across different dialogs, we calculate the performance of solution extraction in the scope of the community.

For all experiments, we apply Cross-Project Evaluation on our dataset to perform the training process. We iteratively select one project as testset, and the remaining seven projects for training. The experiment environment is a Windows 10 desktop computer, NVIDIA GeForce RTX 2060 GPU, intel core i7, and 32GB RAM.

To answer RQ1, we first train the issue dataset with 8 batch size. Each of the convolution and dense layers use 0.6 dropout to avoid overfitting. The optimizer chooses Adam=0.001 and β 1 =0.9. We train ISPY for 100 epochs with 5 P  R  F1  P  R  F1  P  R  F1  P  R  F1  P  R  F1  P  R  F1  P  R  F1  P  R  F1  P  R  F1   Issue   ISPY  76 77 76 75 68 71 84 74 79 77 68 72  82  73 77 80 69 74 79 70 74 86 78 82 80 72 76  NB  36 40 38 41 30 35 47 36 41 70 56 62  08  25 13 22 42 29 30 50 37 15 40 22 34 40 36  RF  56 25 34 69 30 42 75 23 35 84 44 58 100 17 29 50 25 33 33 13 18 23 30 26 61 26 36  GBDT  27 75 40 40 70 51 50 79 61 73 44 55  21  76 33 19 67 29 30 88 44 18 90 30 35 65 46  Casper  39 35 37 08 03 05 59 26 36 46 40 43  19  42 26 14 17 15 05 06 06 15 40 22 26 26 26  CNC PD  20 55 29 23 50 32 23 36 28 12 32 17  24  42 30 12 42 19 10 50 17 05 40 10 16 43 24  DECA PD 33 50 40 28 37 31 33 36 34 64 28 39  42  42 42 44 67 53 32 50 39 04 10 06 35 40 For RQ3, we compare ISPY with its three variants: 1) ISPY-CNN, which removes the textual feature extractor from ISPY, 2) ISPY-Heu, which removes the hueristic attribute extractor from ISPY, and 3) ISPY-LocalAttn, which removes the contextual feature extractor from ISPY. Three variants use the same parameters when training.

The upper half of Table III demonstrates the comparison results between the performance of ISPY and those of the six baselines across data from eight OSS communities, for issue detection tasks. The columns correspond to Precision, Recall, and F1 score. The highlighted cells indicate the best performances from each column. Then, we conduct the normality test and T-test between every two methods. Overall, the data follow a normal distribution (p = 0.32) 2 , and ISPY significantly (p = 10 −20 ) outperforms the six baselines in terms of the average Precision, Recall, and F1 score. Specifically, when comparing with the best Precision-performer among the six baselines, i.e., RF, ISPY can improve its average precision by 19%. Similarly, ISPY improves the best Recallperformer, i.e., GBDT, by 7% for average recall, and improves the best F1-performer, i.e., GBDT, by 30% for average F1 2 Significant test: p < 0.05 score. At the individual project level, ISPY can achieve the best performances on most of the eight communities. These results indicate that ISPY can more accurately detect whether a dialog is discussing an issue, than all comparison baselines.

We believe that the performance advantage of ISPY is mainly attributed to the rich representativeness of its internal construction, from two perspectives: (1) ISPY can accurately capture the semantic relationship between the issue description and its first reply, by using the local window and local attention mechanism. This enables it to learn more comprehensive contextual knowledge, e.g., what kind of first issuereplies represents a dialog head containing issue descriptions. Therefore, it contributes to more accurate classification. (2) ISPY augments the textual vectors with high-level semantic information by employing 15 heuristic attributes. For example, three out of the 15 attributes are characterizing sentiment attributes. This design is based on the observation that issue descriptions are likely to contain negative tones such as "fail", "error", and "annoy". By calculating three types of polarity sentiment scores as shown in Table I , the sentiment attributes can be fit into the deep learning network to help with issuedescription detection.

Answering RQ1: ISPY outperforms the six baselines in detecting issue dialogs across most of the studied projects, and the average Precision, Recall, and F1 are 80%, 72%, and 76%, respectively, improving the best F1-baseline GBDT by 30% on average F1 score.

Similarly, the bottom part of Table III summarizes the comparison results between the performance of ISPY and those of the six baselines across data from eight OSS communities, for solution extraction task. We can see that, ISPY can achieve the highest performance in most of the columns. It significantly (p = 10 −5 ) outperforms the six baselines. On average, although ISPY are slightly below GBDT by 3% of Recall, it reaches the highest F1 score (63%), improving the best baseline RF by 20%. It also reaches the highest precision (68%), significantly higher than other baselines (i.e., ranging from 17% to 37%). These results imply that ISPY can effectively extract utterances as the corresponding solutions from development dialogs: (1) Our approach is sensitive to identifying solutions, including consecutive utterances, by ,VVXH3 3 3 3 3 3 3 3 3    ,VVXH5   3 3 3 3 3 3 3 3 ,VVXH)

,63<

,63</RFDO$WWQ ,63<+HX ,63<&11 

What do you mean "DNN+CNN" model? Beyond trained on conv yes, mini-batches should always be balanced.

What do you mean "DNN+CNN" model? Beyond trained on conv yes, mini-batches should always be balanced. I mean danse Layer+CNN. I used wrong word, I guess.

What do you mean "DNN+CNN" model? Beyond trained on conv yes, mini-batches should always be balanced. I mean danse Layer+CNN. I used wrong word, I guess. The reason why I ask is because batchsize largely depends on data dimension. The main thing it had going for it was the catalyst compiler for Tensor-flow ops.

Beyond trained on conv yes, minibatches should always be balanced. The main thing it had going for it was the catalyst compiler for Tensorflow ops. Nice! Thank you!

Beyond trained on conv yes, minibatches should always be balanced. If minibatch it should ideally be as close to representative of your whole population as possible.

(Empty) Fig. 4 : Test example. employing a local attention mechanism. In live chats, some solutions include consecutive utterances from the same participants. For example, the utterance u b2 and u b3 in Figure  4 are both selected as solution utterances. Our approach can learn that knowledge by adjusting the weight of u b3 to be higher, to increase its probability according to the local attention learning. While baselines (e.g., NB, RF, GDBT, UIT, DECA SP) separately learn the textual characteristics of each utterance, thus are more prone to predict u b3 as a non-solution utterance. (2) Our approach can screen negative feedbacks in consecutive utterances and reject ineffective solutions based on the heuristic attributes (i.e., disapproval keywords and negative sentiment attributes) and local attention mechanism. In live chats, some utterances are indeed solutions but are proved ineffective (e.g., those have follow-up utterances like "It doesn't work") by the dialog initiator later. In such cases, ISPY can detect whether its follow-up utterances contain negative feedback based on the heuristic attributes and local attention learning, while other methods cannot.

We also notice that, the DECA SP baseline can hardly extract the solution utterances correctly. By investigating their 51 linguistic rules and our test dataset, we consider that it comes from two reasons. First, the DECA SP rules are designed for extracting solution proposals from email contents, which have different expressing styles from live chats. Second, the DECA SP rules are kind of strict for live chats. For example, the rule "[something] can be fixed by [something]" cannot deal with its similar variants such as "[something] could/should be fixed by [something]".

Answering RQ2: ISPY outperforms the six baselines in extracting solution utterances in terms of Precision and F1. The average Precision, Recall, and F1 are 68%, 59%, and 63%, respectively, improving the best F1-baseline RF by 20% on average F1 score. Fig. 3 presents the performances of ISPY and its three variants respectively. We can see that, the F1 performances of ISPY are higher than all three variants in both issue-dialog detection and solution extraction tasks. When compared with ISPY and ISPY-LocalAttn, removing the LocalAttn component will lead to a dramatic decrease of the precision (-35%), recall (-60%) , and F1 score (-57%) for the solution-utterance extraction task, as well as a decrease of the precision (-41%), recall (-46%) , and F1 score (-46%) for the solution extraction task. This indicates that the local attention mechanism is an essential component to contribute to ISPY's high performance in both detecting issues and extracting solution utterances. The top three charts in Fig. 3 compare the precision, recall, and F1-score of ISPY and its three variants, for the issuedetection task. Compared to ISPY-Heu and ISPY-CNN, ISPY has moderately better precision and F1, and the recalls of all the three remain very close. It is because that, the contextual information is quite effective in retrieving all the positive-truth instances back for this task, while the other two components mainly contribute to filter the negative-truth instances out, thus can further improve precision. The bottom three charts in Fig.  3 compare the precision, recall, and F1-score of ISPY and its three variants, for the solution-extraction task. Compared to ISPY-Heu and ISPY-CNN, ISPY has moderately better precision, recall, and F1.

Answering RQ3: The textual feature extractor, heuristic attribute extractor, and content feature extractor adopted by ISPY are helpful for extracting issue-solution pairs, while the contextual feature extractor provides a more significant contribution to the effectiveness of ISPY than others.

Experiments in Section V have shown the performances of our approach. In this section, we conduct an application study to further demonstrate the usefulness of our approach.

Procedure. We apply ISPY on live chat data from three new communities: Materialize, Springboot, and WebPack (note that these are different from our studied communities). According to the issue-solution pairs extracted from the three communities, we manually inspect their recently unanswered questions on Stack Overflow, and provide potential solutions correspondingly. First, we crawl the recent (January 2018 to April 2021) live chats of three new communities. Second, we apply ISPY to disentangle the live chats into about 21K dialogs, and generate a dataset with over 9K issue-solution pairs. Because all live chats are historical data, we cannot directly evaluate the usefulness of ISPY with the original developers who inquired about the issue. As an alternative, we investigate the usefulness of ISPY by sharing the discovered solutions to developers facing similar issue on Stack Overflow. Specifically, we employ four Ph.D. students to inspect the recent, unanswered questions in these three communities on Stack Overflow. When finding unanswered issues that have been discussed in live chats, we post the corresponding solutions as potential answers.

Results. Table IV summarizes the results from this application study (More details can be found in our website). The "Contribution Type" column shows how ISPY's solutions contribute to the Stack Overflow. There are three types of contribution: "Accepted Answer" means that ISPY's solution has been adopted as the best answer. "Potential Answer" means that ISPY's solution is listed as a potential answer, but there is no feedback from the question asker yet. "Comment" means that ISPY's solution contributes as a comment while other's answer got accepted. "QID" refers to question ID on Stack Overflow, and "#ans" refers to the total number of posted answers for the question. Overall, ISPY helps with 26 unanswered issues, and there are 6 solutions that have been accepted as the best answers. Fig. 5 presents an example of resolved issue posted on Stack Overflow. We can see that, ISPY can expedite the resolving process of "BuildProperties" issue by providing a workable solution. Specifically, ISPY can provide the unanswered issues with brief root causes (e.g., 57543742), reference documentation (e.g., 59282213), detailed guidelines (e.g., 66868053), etc. ISPY can also provide timely responses to answer-hungry issues. For example, there are 15/26 questions that have no answer at first, and ISPY's solutions sever as their only answers (see red questions with #ans=1 in Table IV ). Summing up, ISPY helps with unanswered issues on Stack Overflow, and there are 6/26 solutions that have been accepted as best answers. The results explicitly show that ISPY can promote knowledge sharing and expedite issue-resolving process.

Serving potential answers. Despite the success of technical Q&A sites such as Stack Overflow, the answer-hungry problem remains a challenging issue for these forums [50] . The mined issue-solution pairs could serve as a knowledge base that can be potentially integrated with, and provide queried information to these Q&A forums. When users ask issues similar to what people have discussed on community chats, the corresponding solutions could be automatically retrieved and recommended as potential answers, relieving the answer-hungry problem to a large extent. On the other hand, the extracted issue-solution pairs are also useful for boosting the automation Q&A in the community live chats by recommending similar questions and corresponding discussions. Boosting developers' profiles. Developers who often provide solutions in community chats may have certain expert knowledge in particular areas. According to their historical answers in live chats, the issue-solution pairs extracted can be further used to recommend or assign appropriate respondents to answer questions. Researchers could also find out which modules or functionalities that the developers are familiar with by analyzing the issue topics that developers have been addressed. Thus, researchers could utilize that information to enhance crowd-sourcing tasks, such as code reviewer recommendation [51] , [52] and issue triage [53] , [54] .

Highlighting unresolved issues. Our approach can effectively find issue-solution pairs from live chats. It can also find issues where their solutions are empty. Such issues are likely to be the unresolved issues spotted in live chats but are not reported to the project repositories (such as Github). It is valuable to make the community notice them, e.g., directly push the unresolved issues to the code repository. Otherwise, they are likely to be buried in the massive live chats, and the team might miss the opportunity to fix them in time. Therefore, a side effect of our approach is to help highlight unresolved issues buried in community live chats.

Augmenting Organization/Community Knowledge Base. ISPY can augment the knowledge base of organizations or communities by including discovered issue-solution pairs from group live chats in an automatic and just-in-time way. Moreover, existing technologies such as ontology [55] and semantic web [56] can be more effective to support information inquiring and sharing across platforms.

Case 1: Handling dialogs with an issue description utterance lagging behind. ISPY use the dialog head (all the utterances posted by the dialog initiator before any reply) as the input of our issue model, with the assumption that the dialog initiators are likely to express the issues at the beginning. However, we find that dialog initiators occasionally lag their issue description behind. Here is an example, we can see that, the issue description appears in the u b2 utterance, beyond the scope of the dialog head. In such cases, ISPY cannot accurately detect issues. In the future, we plan to dynamically extend the scope of the dialog head, so that the lagging issue descriptions can be included. Case 2: Representing different confidence levels on extracted solutions. When discussing issues in community live chats, developers who post the issues often give feedbacks about the solutions provided by their peers at the end. For example, we find that some typical feedbacks are: "It works.", "The issue is fixed.", "Figured it out.", etc. These feedbacks can indicate different confidence levels of the corresponding solution: Confirmed and Candidate. "Confirmed" refers to the solution that has been proved to work by the initiator, and "Candidate" refers to the solution that has the potential to resolve the issue. In the future, we plan to refine the extracted solutions by providing different confidence levels as well.

Case 3: Differentiating solutions involving version numbers. From the usefulness evaluation results (Table IV) , we notice that the question "66378139" did not select ISPY's solution as its best answer. This is because that the ISPY's solution is extracted from the dialog discussing the similar issue in Spring Boot 3.0.0 while the posted issue is related to Spring Boot 2.4.3. Thus, the solution might not be completely suitable. In the future, we plan to address this issue by linking each issue-solution pair to its corresponding version for better application.

Case 4: Enhancing smoothness when combining solution. We directly combine the predicted solution utterances according to their chronological orders as the solution. For the predicted utterances that are not consecutive in the original dialog, logical gaps exist between them. Thus, it may reduce the readability of ISPY's solutions. In the future, we would like to improve the readability of the extracted solutions by leveraging language models [57] .

The first threat is the generalizability of the proposed approach. It is only evaluated on eight open-source projects, which might not be representative of closed-source projects or other open-source projects. The results may be different if the model is applied to other projects. However, our dataset comes from eight different fields. The variety of projects relatively reduce this threat.

The second threat may come from the results of dialog disentanglement. The accuracy of disentangled dialog has an impact on our results. To reduce the threat, we employed the state-of-the-art technique proposed by Kummerfeld et al. [24] , which outperforms previous studies by achieving 74.9% precision and 79.7% recall. Therefore, we believe this can serve as a good foundation for our study on mining issuesolution pairs.

The third threat relates to the construct of our approach. First, we hypothesize that issue description is likely to appear in dialog head, which is occasionally incorrect in certain cases. Second, we do not add version information to issue-solution pairs, which may result in recommending inappropriate solutions. To alleviate the threat, we thoroughly analyzed where our approach performs unsatisfactorily in section VII, and planned future work for improvement. Third, the dataset used for the training of the approach includes 750 dialogs and 171 issue-solution pairs from eight Gitter communities, which is not quite large. To avoid the risks of overfitting, we combined dropout with early stopping when training. We observed that the training convergences were achieved at epoch 10-15, and the performance could not be better even more data is given for training.

The fourth threat relates to the suitability of evaluation metrics. We utilize precision, recall, and F1 to evaluate the performance. We use the dialog labels and utterance labels manually labeled as ground truth when calculating the performance metrics. The threats can be largely relieved as all the instances are reviewed with a concluding discussion session to resolve disagreement in labels based on majority voting.

Recently, more and more work has realized that community chat plays an increasingly significant role in software development, and chat messages are a rich and untapped source for valuable information about the software system [1] , [3] , [58] . There are several studies focusing on extracting knowledge from developer conversations. Di Sorbo et al. [46] proposed a taxonomy of intentions to classify sentences in developer mailing lists. Huang et al. [44] addressed the deficiencies of Di Sorbo et al's taxonomy by proposing a convolution neural network (CNN)-based approach. Qu et al. [28] utilized classic machine learning methods to perform user intent prediction with an average F1 of 0.67. Shi et al. [59] proposed an approach to detect feature-request dialogues from developer chat messages via a deep siamese network. Rodeghero et al. [60] presented a technique for automatically extracting information relevant to user stories from recorded conversations. Chowdhury and Hindle [61] filtered out offtopic discussions in programming IRC channels by engaging Stack Overflow discussions as positive examples and YouTube video comments. The findings of previous work motivate the work presented in this paper. Our study is different from the previous work as we focus on extracting issue-solution pairs from massive chat messages that would be important and valuable information for OSS developers to check and fix issues. In addition, our work complements the existing studies on knowledge extraction from developer conversations.

Emerging Issue Detection. Detecting emerging issues from user feedback timely and precisely is vital for developers to update their applications. Most current work focuses on detecting the emerging issues from short-text social media (e.g., Twitter and Google Play), and determining the emerging issues based on traditional anomaly detection methods. For example, Guo et al. [42] proposed a method for extracting and synthesizing user-reported mini-stories regarding app problems from reviews. Vu et al. [62] detected emerging issues and trends by counting negative keywords based on Google Play. Since the single words might be ambiguous without contexts, their follow-up work [63] proposed a phrasebased clustering approach that relied on manual validation of part-of-speech (PoS) sequences. Gao et al. [64] presented a topic labeling approach, named IDEA, to automatically detect emerging issues of current versions based on statistics of previous versions. Due to the inborn limitations of topic modeling, such as the predefined topic numbers, their followup work [65] introduced DIVER which incorporated depthfirst pattern mining with version and time-based comparisons. Most of these methods focus on detecting emerging issues embedding in short-text social media, while our approach targets to automatically extract issues with their potential solutions (if exists) from community chats, complementing the existing studies on a novel source. In addition, our approach can not only detect emerging issues in community chat, but also extract relevant solutions with resolved issues for reuse purposes, aiming to expedite the issue resolving process.

Community-based question and answer extraction. Generating large-scale technical question-answer pairs is critical for contributing knowledge that can facilitate software development activities. Existing studies are designed to find questions and corresponding answers from synchronous conversations, i.e., mailing lists and forums. Shrestha et al. [66] first trained a set of if-then rules to predict questions in email messages, and another set of if-then rules to predict corresponding answers based on features of texts. Huang et al. [67] presented an approach for extracting high-quality <thread-title, reply> pairs from online forums based on SVM classifier and contentquality ranking. Cong et al. [68] proposed a sequential patternbased classification method to detect questions in a forum thread, and a graph-based propagation method to detect answers for questions in the same thread. Since previous studies extracted only questions in interrogative forms, Kwong et al. [69] extended the scope of questions and answer detection, and pairing to encompass also questioned in imperative and declarative forms. Henß et al. [70] presented an approach to extract FAQs from sources of software development mailing lists automatically. These approaches utilize the characteristics of their corpora and are best fit for their specific tasks, but they limit each of their corpora and tasks, so they cannot directly transform their methods to the task of extracting issue-solution pairs from community chats.

IX. CONCLUSION In this paper, we propose an approach, named ISPY, to automatically extract issue-solution pairs from development community live chats. ISPY leverages a novel convolutional neural network by incorporating a basic CNN network with 15 heuristic attributes and Local-Attention mechanism to handle the characteristics of this task. We build a dataset with 750 dialogs, including 171 issue-solution pairs, and evaluate ISPY on it. The evaluation results show that our approach outperforms both issue-detection baselines and solution-extraction baselines by substantial margins. By applying ISPY, we also automatically generate a dataset with over 30K issue-solution pairs extracted from 11 community live chats, and we utilize the dataset to provide solutions for 26 recent issues posted on Stack Overflow.

Exploratory Study of Slack Q&A Chats as a Mining Source for Software Engineering Tools

Finding Help with Programming Errors: An Exploratory Study of Novice Software Engineers' Focus in Stack Overflow Posts

Why Developers Are Slacking Off: Understanding How Software Teams Use Slack

On the Use of Internet Relay Chat (IRC) Meetings by Developers of the GNOME GTK+ Project

Studying the Use of Developer IRC Meetings in Open Source Projects

How Was Your Weekend?' Software Development Teams Working From Home During COVID-19

An Empirical Study of Developer Discussions in the Gitter Platform

A First Look at Developers' Live Chat on Gitter

Rationale in Development Chat Messages: An Exploratory Study

How do Developers Discuss Rationale?

DialBERT: A Hierarchical Pre-Trained Model for Conversation Disentanglement

End-to-End Transition-Based Online Dialogue Disentanglement

A Large-Scale Corpus for Conversation Disentanglement

Chat Disentanglement: Identifying Semantic Reply Relationships with Random Forests and Recurrent Neural Networks

Online Conversation Disentanglement with Pointer Networks

Explosion

Abbreviations -Oxford English Dictionary

Emojis influence emotional communication, social attributions, and information processing

Emoji helps! A multi-modal siamese architecture for tweet user verification

Perplexity of n-Gram and Dependency Language Models

Exploring the Limits of Language Modeling

A Large-Scale Corpus for Conversation Disentanglement

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Cost-Sensitive BERT for Generalisable Sentence Classification on Imbalanced Data

Transfer Learning in Biomedical Named Entity Recognition: An Evaluation of BERT in the PharmaCoNER task

User Intent Prediction in Information-seeking Conversations

Convolutional Neural Networks for Sentence Classification

Imagenet Classification with Deep Convolutional Neural Networks

Term-weighting Approaches in Automatic Text Retrieval

From Word Embeddings to Document Distances

Neural Machine Translation by Jointly Learning to Align and Translate

Attention is All You Need

REST API

Machine Learning, a Probabilistic Perspective

A Comparison of Event Models for Naive Bayes Text Classification

Classification and Regression by Random-Forest

Lightgbm: A Highly Efficient Gradient Boosting Decision Tree

Caspar: Extracting and Synthesizing User Stories of Problems from App Reviews

Stories in App Reviews

Automating Intention Mining

Neural Networks: A Comprehensive Foundation

Development Emails Content Analyzer: Intention Mining in Developer Discussions (T)

UZH-s.e.a.l.-Development Emails Content Analyzer (DECA)

Analyzing and Characterizing User Intent in Information-seeking Conversations

Glove: Global Vectors for Word Representation

Technical Q8A Site Answer Recommendation via Question Boosting

Multiobjective Code Reviewer Recommendations: Balancing Expertise, Availability and Collaborations

Automatically Recommending Peer Reviewers in Modern Code Review

Considering Dependencies between Bug Reports to Improve Bugs Triage

Automatic Bug Triage in Software Systems Using Graph Neighborhood Relations for Feature Augmentation

Handbook on Ontologies

A Semantic Web primer. A Semantic Web primer

ERNIE 2.0: A Continual Pre-Training Framework for Language Understanding

Softwarerelated Slack Chats with Disentangled Conversations

Detection of Hidden Feature Requests from Massive Chat Messages via Deep Siamese Network

Finding Questionanswer Pairs from Online Forums

Detecting User Story Information in Developer-client Conversations to Generate Extractive Summaries

Mining StackOverflow to Filter out Off-topic IRC Discussion

Mining User Opinions in Mobile App Reviews: A Keyword-Based Approach (T)

Phrase-based Extraction of User Opinions in Mobile App Reviews

Online App Review Analysis for Identifying Emerging Issues

Emerging App Issue Identification from User Feedback: Experience on WeChat

Detection of Question-Answer Pairs in Email Conversations

Extracting Chatbot Knowledge from Online Discussion Forums

Detection of Imperative and Declarative Question-answer Pairs in Email conversations

Semi-automatically Extracting FAQs to Improve Accessibility of Software Development Knowledge

We deeply appreciate anonymous reviewers for their constructive and insightful suggestions towards improving this manuscript. This work is supported by the National Key