key: cord-0680161-e6k56zj6
authors: Shi, Lin; Chen, Xiao; Yang, Ye; Jiang, Hanzhi; Jiang, Ziyou; Niu, Nan; Wang, Qing
title: A First Look at Developers' Live Chat on Gitter
date: 2021-07-13
journal: nan
DOI: nan
sha: de2544e66a99b49da65f7ecb0f63d6ea817d3083
doc_id: 680161
cord_uid: e6k56zj6

Modern communication platforms such as Gitter and Slack play an increasingly critical role in supporting software teamwork, especially in open source development.Conversations on such platforms often contain intensive, valuable information that may be used for better understanding OSS developer communication and collaboration. However, little work has been done in this regard. To bridge the gap, this paper reports a first comprehensive empirical study on developers' live chat, investigating when they interact, what community structures look like, which topics are discussed, and how they interact. We manually analyze 749 dialogs in the first phase, followed by an automated analysis of over 173K dialogs in the second phase. We find that developers tend to converse more often on weekdays, especially on Wednesdays and Thursdays (UTC), that there are three common community structures observed, that developers tend to discuss topics such as API usages and errors, and that six dialog interaction patterns are identified in the live chat communities. Based on the findings, we provide recommendations for individual developers and OSS communities, highlight desired features for platform vendors, and shed light on future research directions. We believe that the findings and insights will enable a better understanding of developers' live chat, pave the way for other researchers, as well as a better utilization and mining of knowledge embedded in the massive chat history.

More than ever, online communication platforms, such as Gitter, Slack, Microsoft Teams, Google Hangout, and Freenode, play a fundamental role in team communications and collaboration. As one type of synchronous textual communication among a community of developers, live chat allows developers to receive real-time responses from others, replacing asynchronous communication like emails in some cases [40, 58, 59] . This is especially true for open source projects that are contributed by globally distributed developers, as well as for many companies allowing developers to work from home due to the COVID-19 pandemic. Conversations from online communication platforms contain rich information for studying developer behaviors. Figure 1 exemplifies a slice of live chat log from the Deeplearning4j Gitter community. Each utterance consists of a timestamp, developer ID, and a textual message. In addition, two dialogs are embedded in the chat log. The first one is reporting an issue about 'earlystop', and the second one is asking for documentation support. Thus, valuable information, such as when OSS developers interact, what are community structures, which topics are discussed, and how OSS developers interact, can be derived from the massive live chat data, which are important for learning knowledge from productive and effective communication styles, improving existing live chat platforms, and guiding research directions on promoting efficient and effective OSS collaboration.

Despite that a few empirical studies started to advocate the usefulness of the conversations for understanding developer behaviors [40, 59, 67] , little focuses on how and what the developers communicate in live chat. The most related research is reported by Shihab et al. [58] on Internet Relay Chat to analyze content, participants, and styles of communications. However, their subjects are IRC meeting logs, which are different in many aspects from developer live chat conversations. Another thread of related work by Parra et al. [49] and Chatterjee et al. [15] presents two datasets of open source developer communications in Gitter and Slack respectively, with the purpose of highlighting that live developer communications are untapped information resources. This motivates our study to derive a deeper understanding about the nature of developer communications in open-source software.

In this paper, we conduct a first comprehensive empirical study on developers' live chat on Gitter, investigating four characteristics: when they interact (communication profile), what community structures look like (community structure), which topics are discussed (discussion topic), and how they interact (interaction pattern). To that end, we first collect a large scale of developer daily chat from eight popular communities. Then we manually disentangle 749 dialogs, and select the best disentanglement model from four stateof-the-art models according to their evaluation results on the 749 dialogs. After automatic disentanglement, we perform an empirical study on live chat aiming to reveal four characteristics: communication profile, community structure, dialog topic, and interaction pattern. In total, we studied 173,278 dialogs, 1,402,894 utterances, contributed by 95,416 users from eight open source communities. The main results include: (1) Developers are more likely to chat on workdays than weekends, especially on Wednesday and Thursday (UTC); (2) Three social patterns are observed in the OSS community of live chat: Polaris network, Constellation network, and Galaxy network; (3) The top three topics that developers frequently discuss in live chat are API usages, errors, and background information; and (4) Six interaction patterns are identified in live chat including exploring solution, clarifying answer, clarifying question, direct/discussed answer, self-answered monologue, and unanswered monologue. The major contributions of this paper are as follows.

• We conduct a first large scale analysis study on developers' live chat messages, providing empirically-based quantitative and qualitative results towards better understanding developer communication profiles, community structures, discussion topics, and interaction patterns. • We provide practical insights on productive dialogs for individual developers and OSS communities, highlight desired features for platform vendors, and shed light on future directions for researchers. • We provide a large-scale dataset 1 of live chat to facilitate the replication of our study and future applications. 1 https://github.com/LiveChat2021/LiveChat#5-download

In the remainder of the paper, Section II illustrates the background. Section III presents the study design. Section IV describes the results and analysis. Section V is the discussion of results and threats to validity. Section VI introduces the related work. Section VII concludes our work.

This section describes related key concepts and technologies.

Many OSS communities utilize Gitter [31] or Slack [32] as their live communication means. In particular, Gitter is currently the most popular online communication platform [37] since it provides open access to public chat rooms and free access to historical data [49] . Considering the popular, open, and free access nature, we conduct this study based on Gitter 2 . Communities in Gitter usually have multiple chatting rooms: one general room and several specific-topic rooms. Typically, the general room contains most of the participants. In this study, we only focus on the general rooms. In total, there are 2,171 communities in Gitter that can be publicly accessed. The total number of participants of the 2,171 communities is 733,535 as of Nov. 20, 2020.

Three main concepts about Gitter live chat log are concerned in the scope of this study, including chat log, utterance, and dialog. In Gitter, developer conversations in one chatting room are recorded in a chat log. As illustrated in Figure 1 , a typical live chat log contains a sequential set of utterances in chronological order. Each utterance consists of a timestamp, developer id, and a textual message initiating a question or responding to an earlier message. A chat log typically contains a large number of utterances, and at any given time, multiple consecutive utterances might be possibly responding to different threads of dialog discussions. The interleaving nature of utterances leads to entangled dialogs, as illustrated with the two colors in Figure 1 . The two colors are used to highlight utterances belonging to two different dialogs, where the links between two utterances indicate the responding relationship.

Different from many other sources of software development related communication, the information on online communication platforms are shared in an unstructured, informal, and interleaved manner. Thus, analyzing live chat is quite challenging due to the following barriers. (1) Entangled dialogs. Utterances in chat logs form stream information, with dialogs often entangling such as a single conversation is interleaved with other dialogs, as shown in Figure 1 . It is difficult to perform any kind of high-level dialog analysis without dividing utterances into a set of distinct dialogs.

(2) Expensive human effort. Chat logs are typically high-volume and contain informal dialogs covering a wide range of technical and complex topics. Analyzing these dialogs requires experienced analysts to spend a large amount of time so that they can understand the dialogs thoroughly. Thus, it is very expensive to conduct a comprehensive study on developers' live chat. (3) Noisy data. There exist noisy utterances such as duplicate and unreadable messages in Figure 2 : Overview of research methodology chat logs that do not provide any valuable information. The noisy data poses a difficulty to analyze and interpret the communicative dialogs. Next, we will introduce several existing techniques that can automatically disentanglement dialogs.

Four state-of-the-art techniques have been proposed to address the entangled dialog challenge in the natural language processing area: (1) BiLSTM model [33] predicts whether there exists an edge between two utterances, where the edge means one utterance is a response to another. It employs a bidirectional recurrent neural network with 160 context maximum size, 200 neurons with one hidden layer. The input is a sequence of 512-dimensional word vectors;

(2) BERT model [19] predicts the probability of utterance 's clustering label under the context utterances ≤ and labels < . It utilizes the Masked Language Model and Next Sentence Prediction [19] to encode the input utterances, with 512 embedding size and 256 hidden units;

(3) E2E model [41] performs the dialog Session-State encoder to predict dialog clusters, with 512 embedding size, 256 hidden neurons and 0.05 noise ratio; and (4) FF model [39] is a feedforward neural network with two layers, 256-dimensional hidden vectors, and softsign non-linearities. The input is a 77-dimensional numerical feature extracted from the utterance texts, which includes TF-IDF, user name, time interval, whether two utterances contain the same words, etc. In addition, there are four clustering metrics that are widely used for DD evaluation: Normalized Mutual Information (NMI) [61] , Adjusted Rand Index (ARI) [54] , Shen-F value [56] and F1 score [18] .

This study aims to investigate four research questions: RQ1 (Communication Profile): Do Gitter communities demonstrate consistent community communication profiles? This research question aims at examining common communication profiles across the eight communities, particularly the frequent time-frames that the developers are active and the typical time interval of a dialog.

RQ2 (Community Structure): What are the structural characteristics of social networks built from developer live chat data? To understand community characteristics of live chat networks, we perform social network analysis on live-chat utterances, in which each developer is treated as a node, with edges defined as a pair of developers co-occurring in one or more dialogs.

RQ3 (Dialog Topic): What are the primary topic types frequently discussed by developers in live chat? This research question is designed to identify discussion topics in developers' live chat. There have been studies analyzing discussion topics in open forums [4] , emails [34] , and posts in Stack Overflow [9, 35] . It remains unknown what developers are talking about in live chat. This study aims at filling the gap and providing a complementary perspective using live-chat as a new data source.

RQ4 (Interaction Pattern): How do developers typically interact with each other in live chat? This research question intends to uncover underlying interaction patterns which signify how developers typically interact (e.g., initiate discussion, respond to questions and social chat.) with one another throughout a dialog life cycle.

The research methodology follows two phases, as illustrated in Figure 2 . First, in the data preparation phase, a large scale of developer daily chat utterances data are collected from eight active communities, and the raw chat utterances data are processed and transformed into associated dialogs using two approaches, i.e., manual screening on a randomly sampled small dataset and automated analysis employing the identified best DD model on the whole dataset. Second, in the empirical analysis phase, further analysis will be designed and conducted on these two datasets in order to investigate the characteristics and develop a better understanding of developer live chat data with respect to the research questions.

Studied communities. To identify studied communities, we select the Top-1 most participated communities from eight active domains, covering front-end framework, mobile, data science, De-vOps, blockchain platform, collaboration, web app, and programming language. Then, we collect the daily chat utterances from these communities. Gitter provides REST API [28] to get data about chatting rooms and post utterances. In this study, we use the REST API to acquire the chat utterances of the eight selected communities, and the retrieved dataset contains all utterances as of "2020-11-20". Detailed statistics are shown in Table 1 , where P refers to the number of participants, D refers to the number of dialogs, and U refers to the number of utterances. The total number of participants for the eight communities is 95,416, accounting for 13% of the total population in Gitter. Thus, we consider that the eight communities are representative of the Gitter platform.

Prepossessing the textual utterances. We first normalize non-ASCII characters like emojis to standard ASCII strings. Some lowfrequency tokens contribute little to the analysis of live chat, such as URL, email address, code, HTML tags, and version numbers. We replace them with specific tokens <URL>, <EMAIL>, <HTML>, <CODE>, and <ID>. We utilize Spacy [2] to tokenize sentences into terms and perform lemmatization and lowercasing on terms with Spacy to alleviate the influence of word morphology.

Manual labeling of dialog disentanglement. We employ a 3step manual process to generate a sample dialog dataset for further analysis. First, we randomly sample 100 utterances from each community's live chat log, with the intention to trace corresponding dialogs associated with each of the 100 utterances. This step leads to a total of 800 utterances from the entire 1,402,894 utterances of the eight communities. Next, using each utterance as a seed, we identify its preceding and successive utterances iteratively so that we can group related utterances into the same dialog as complete as possible. Specifically, for each utterance, we determine its context by examining the consecutive chats in the chat log, and manually link it to related utterances belonging to the same dialog. Then, the next step is cleaning. Specifically, we excluded unreadable dialogs: (1) dialogs that are written in non-English languages; (2) dialogs that contain too much code or stack traces; (3) Low-quality dialogs such as dialogs with many typos and grammatical errors; and (4) Dialogs that involve channel robots. After these steps, we include additional six thousand utterances which are associated with the initial 800 utterances. This leads to a total of 7,226 utterances, manually disentangled into 749 dialogs, as summarized in Table 1 . Note that, removing bot-involved dialogs has little impact on our results. First, the bot-involved dialogs are relatively small in volume. In our study, only one of the eight projects utilizes bots, and more specifically, only nine out of 800 sampled dialogs are excluded due to bot involvement. Second, we observed that the bot-generated utterances are rather trivial, such as greeting information, links to general guidelines, and status updates.

To ensure the correctness of the disentanglement results, a labeling team was put together, consisting of one senior researcher and six Ph.D. students. All of them are fluent in English, and have done either intensive research work with software development or have been actively contributing to open-source projects. The senior researcher trained the six Ph.D. candidates on how to disentangle dialogs and provided consultation during the process. The disentanglement results from the Ph.D. candidates were reviewed by others. We only accepted and included dialogs to our dataset when the dialogs received full agreement. When a dialog received different disentanglement results, we hosted a discussion with all team members to decide through voting. The average Cohen's Kappa about dialog disentanglement is 0.81.

Automated Dialog Disentanglement. To analyze the dialogs on a large scale, we experiment with the four state-of-art DD approaches, as introduced in Section 2.2 (i.e. BiLSTM model, Bert model, E2E model, and FF model). Specifically, we use the manual disentanglement sample data from the previous step as ground truth data, compare and select the best DD model for the purpose of further analysis in this study. The comparison results from our experiments show that the FF approach significantly outperforms the others on disentangling developer live chat by achieving the highest scores on all the metrics. The average scores of NMI, Shen-F, F1, and ARI are 0.74, 0.81, 0.47, and 0.57 respectively 3 . Finally, we use the best FF model to disentangle all the 1,402,894 utterances in chat logs. In total, we obtain 173,278 dialogs.

As good communication habits suggest more productive development practices, we intend to reveal the temporal communication profiles of developers, including when the developers are active and how long the respondent replies to the dialog initiator. First, we collect all the utterance time of the entire population, and analyze the peak hours and peak days. Then we calculate the response time lag :

where is the time that the initiator launched the dialog, and is the time the first respondent replied. We automatically calculate the utterance times and response time lags for all the 173,278 dialogs.

We aim to visualize the social networks of developers in live chat, and summarize the common structures. Social network analysis (SNA) describes relationships among social entities, as well as the structures and implications of their connections [65] . For studying relationships among developers in one OSS community, we generate the social networks according to the following definition:

where is a developer in the chatting room, is one dialog initiator, and is a respondent to . Specifically, for each disentangled dialog, we first identify its initiator and all the respondents. The initiator is the developer who launches the dialog, and the respondents are other developers who participate in the dialog. Then we add a link between the initiator and each respondent. (Note that, RQ2 focuses on exploring responding behaviors, i.e., the interaction between initiators and respondents, thus the links/edges are defined only between these two roles. Additionally, we will explore the interaction relationship among initiators and all responders in RQ4, which focuses on discussion behaviors.) Finally, we employ an unweighted graph when constructing the social networks, for visualizing the relationship of all the developers in live chat. The social network could exhibit the connectivity and density of the open-source community. We build the social networks based on all the 173,278 dialogs by using the automatic graph tool, Gephi [6] .

To understand the topology of the eight social networks, we report the following SNA measures that have been widely used by previous studies [45, 55] . Note that, we excluded developers who never received replies (AKA. Haircut) when calculating SNA measures following previous work [11, 62] . (1) Nonetheless, the predefined category is not meant to be comprehensive, thus, we employ a hybrid card sort process [22] to manually determine the topics of dialogs. In a hybrid card sort, the sorting begins with the predefined Beyer et.al's category and participants could create their own as well. The newly-created topic is instantly updated into the topic set and can be used by other participants then. The participants are the same team that manually disentangle dialogs, and the labeling process is similar to manual dialog disentanglement as introduced in Section 3.2. Specifically, the sorting process is conducted in one round, with a concluding discussion session to resolve the disagreement in labels based on majority voting. The average Cohen's Kappa about dialog topics is 0.86.

Live-chat conversations generally serve the purposes of solution exploration and discussion stimulation. To uncover underlying patterns that shape and/or direct more productive conversations, we first adopt a developer intent codebook [50] and manually label the interaction links that appeared in each dialog. The developer intent codebook is built from previous work on user intent in information-seeking conversations [50] , as summarized in Table 2 . Then, we employ an open card sort [52] process to assign an interactive pattern to a dialog based on the sequence of developers' intents. In an open sort, the sorting begins with no predefined patterns and participants develop their own patterns. The two participants individually assigned patterns to the same dialogs. The sorting process is conducted in multiple rounds. In the first round, all participants label dialogs of one community, with an intensive discussion session to achieve conceptual coherence about patterns. A shared pool of patterns is utilized and carefully maintained, and each participant could select existing patterns from and/or add new pattern names into the shared pool. Then we divide into two teams to label the remaining dialogs. Each dialog will receive two pattern labels, and we resolve disagreement based on majority voting. The average Cohen's Kappa about interactive patterns is 0.82.

After identifying underlying interaction patterns, we further explore their statistical characteristics in aspects of distribution and duration. We calculate the duration of a dialog as follows:

where is the time that the dialog ended, and is the time that the initiator launched the dialog. This metric can reflect the life cycle of one dialog. Note that, to keep the workload of manually labeling each dialog manageable, we answer RQ3 and RQ4 by manually analyzing the 749 sampled dialogs. We believe that although we could only manually analyze a small percentage of the disentangled dialogs, this dataset supports our methodology as being useful for discovering valuable findings.

To answer this question, we analyze two metrics, i.e. utterance time and response time. Next, we report the results of comparing these metrics across the eight Gitter communities.

Utterance Time. Figure 3 (a) compares the distribution of utterances' intensity over 24 hours, across the eight communities. First, we identify the peak hours of each community in red dashed circles, then highlight the time windows based on the peak hours contained in it with the yellow shade. We can see that, there are three windows of peak hours, which are from UTC 9 to 10, 13 to 14, and 18 to 21. In addition, UTC 1 to 6 corresponds to the low chattingactivity hours. Developers are less active in chatting at that time. Figure 3(b) shows the distribution of the utterances across different weekdays. We can see that developers chat more on workdays than on weekends (UTC).

Response Time. Figure 3 (c) exhibits the distribution of response time calculated from the 173,278 dialogs of the eight communities. The average response time is 220 seconds, the maximum time lag is 1,264 seconds, and the minimum time lag is two seconds. The peak point is (23, 393) , which means there are 393 dialogs that got replies in 23 seconds. We can see that, the time lag largely increases from 0 to 23 seconds, and descend in a long tail. Eighty percent of the dialogs get first responses in 343 seconds. As reported by a recent study on Stack Overflow [43] , the threshold of fast answers was 439 seconds. In comparison, live chat gets 50% faster ((439-220)/439) ( replies than the fast answers in Stack Overflow. Therefore, we consider the responses from the live chat are relatively fast.

Answering RQ1: The peak hours for live chat are from UTC 9 to 10, 13 to 14, and 18 to 21, while UTC 1 to 6 is the low-active hours. Developers are more likely to chat on workdays than weekends, especially on Wednesdays and Thursdays (UTC). Moreover, live chat gets 50% faster replies than the fast answers in Stack Overflow.

To answer RQ2, we first examine the structural properties of the developer social networks across the eight communities, and then try to draw some common observations based on these social networks.

Properties of social networks. Table 3 shows the social network properties of the eight communities. Init.%, Resp.%, and Both% denote the percentage of developers serving the role of dialog initiators, respondents, and both. Intuitively, we consider that respondents share their knowledge with others, while initiators receive knowledge from others. We can see that, the four communities (Appium, Docker, Gitter, and Ethereum) have a higher percentage (75.04%-81.70%) of dialog initiators and a lower percentage (18.30%-24.96%) of respondents/both. The high percentage of dialog initiators may relate to the applicable nature of the open-source projects, e.g., Ethereum is one of the most widely used open-source blockchain systems, thus there are a large number of users acquiring technical support from live chat. The other four communities (Angular, DL4J, Nodejs, and Typescript) have a higher percentage (29.94%-48.62%) of respondents/both. A possible explanation is that these four projects are more widely used for development purposes, e.g., Angular is a platform for building mobile and desktop web applications, therefore, such communities appear to be knowledgesharing and collaborative.

Categorizing developer social networks. Figure 4 shows the social network visualizations of the eight communities generated by Gephi. Each node represents one developer, and the edge denotes the dialog relationship between two developers. We color the vertex of the initiator with blue, the vertex of the respondent with white, and the vertex of both roles with orange. In addition, the node's size indicates its corresponding degree. Based on the observation on community structures, we categorize the eight communities into three groups, consisting of: (1) Polaris network is a type of highly centralized network where the community is organized around its single focal point; (2) Constellation network is a type of moderately centralized network where the community is organized around its multiple focal points; and (3) Galaxy network is a type of decentralized network where all individuals in the community have similar relationships. In Figure 4 , the four communities on the top (i.e., Angular, DL4J, NodeJS, and Typescript) belong to the Constellation network, i.e., moderately centralized network. Three communities (i.e., Appium, Docker, and Gitter) belong to the Polaris network, i.e., a highly centralized network. The remaining Ethereum community belongs to the Galaxy network, i.e., decentralized network. Previous studies have shown that a highly centralized network may reflect an uneven distribution of knowledge across the community, where knowledge is mostly concentrated at the focal points [38, 44] . Therefore, the three Polaris communities (Appium, Docker, and Gitter) may have a higher risk of single-point failure, if the focal developer is inactive, whereas the Galaxy network (Ethereum) has the lowest risk, followed by the Constellation network (Angular, DL4J, NodeJS, and Typescript).

In Table 3 , we can also see that the Constellation networks and Polaris networks have higher scores in terms of average degree (1.96-9.15), betweenness (0.000273-0.001342), and closeness (0.31-0.43). The phenomena indicate that the focal points in Constellation networks and Polaris networks make the communities more connected. A study on email-connected social networks [11] shows that the mean betweenness of developers is 0.0114, which on average is higher than live chat communities. Nodes with high betweenness may have considerable influence within a network in allowing information to pass from one part of the network to the other. Lower betweenness indicates that developers in the live chat may have less influence than developers in email in spreading information. However, the average in-degree and out-degree of networks built on emails are significantly lower, with 0.00794 and 0.00666 for developers. While developers in live chat have more concentration and higher density, along with the closeness centrality values, indicating a more closely connected community than that from email. Developers in Constellation communities have higher clustering coefficient scores (0.14-0.60), indicating that developers in Constellation communities are more densely connected, i.e., the developers of Angular, DL4J, Nodejs, and Typescript know each other better than the others.

Answering RQ2: By visualizing the social networks of eight studied communities, we identify three social network structures for developers' live chat. Half of the communities (4/8) are Constellation networks. A minority of the communities (3/8) are Polaris networks. Only one community belongs to the Galaxy network. In Figure 4 : Visualization of the eight developer live-chat social networks comparison, we find that developers in the live chat may have less influence than developers in email in spreading information, but have a more closely connected community than that from email. The most inner circle shows that, across all eight communities, 89.05% of dialogs are domain-related (DR) topics such as topics related to the business domain of the community, while 10.95% of dialogs are non-domain related (N-DR) topics such as general development or social chatting. DR topics can be further decomposed into three sub-categories based on their different purposes. These include solution-oriented dialogs which have the highest proportion (35.25%), followed by problem-oriented dialogs (32.98%) and knowledge-oriented dialogs (20.83%). Among the 35.25% solutionoriented dialogs, 29 .37% are about API usage, and 5.87% are about Review. Among the 32.98% problem-oriented dialogs, most of them (20.29%) discuss discrepancy, consisting of unwanted behavior, do not work, reliability issue, performance issue, and test/build failure. We can see that, developers discuss more 'unwanted behavior' and 'do not work', than reliability issues, performance issues, and test/build failures. Among the 20.83% knowledge-oriented dialogs, most of them (13.75%) discuss conceptual, consisting of background info, Answering RQ3: Developers launch solution-oriented dialogs and problem-oriented dialogs more than knowledge-oriented dialogs. Nearly 1/3 of dialogs are about API usage. Developers discuss more error, unwanted behavior, and do-not-work, than reliability issues, performance issues, and test/build failures.

Interaction patterns. Figure 6 illustrates the six interaction patterns in live chat, constructed using open card sorting as introduced in Section 3.3.4. This figure shows dialog initiators in blue nodes, respondents in yellow nodes. The lines denote the reply-to relationships, and the labels represent developer intents in Table 2 . In this work, we identify the following six interaction patterns: (1) P1: Exploring Solutions. Given the original questions posted by the dialog initiator, other developers provide possible answers. But Figure 7 shows the percentage of interaction patterns in different communities, and the average percentages are shown in the legends. P1~P6 refer to the six interaction patterns defined above. We can see that the direct/discussed answer (P4) pattern takes the largest proportions in most communities. In addition, we note that quite a few dialogs (1%) belong to selfanswered monologue, while 24% of dialogs belong to unanswered monologue. Nearly 1/4 of dialogs did not get responses in live chat. We will discuss more monologue in Section 5.1.

Duration of patterns. Figure 8 shows the violin plots with the distribution of duration for each pattern. P1~P5 refer to interaction patterns defined above. Here we only exhibit five patterns because the P6 refers to unanswered monologues which barely have a duration. We can see that although P1 takes a small proportion in dialogs, it lasts the longest. Its average duration is 1.00 hours. P2 and P3 last slightly longer than P4. P5 lasts the shortest, and its average duration is 0.02 hour.

Answering RQ4: Six interaction patterns are identified in live chat: exploring solutions, clarifying answer, clarifying question, direct/discussed answer, self-answered monologue, and unanswered monologue. The direct/discussed answer pattern takes the largest proportions in most communities. There are still 1/4 dialogs that did not get responses on average. Dialogs that belong to the Exploring Solutions pattern last the longest time than others.

In this work, we take a first look at developers' live chat on Gitter in terms of communication profile, community structure, discussion topic, and interaction pattern. Our work paves the way for other researchers to be able to utilize the same methods in other software communities. Additionally, as communication is a large part of successful software development, and Gitter is one of the main platforms for communication of GitHub users, it is important to explore how software engineers use Gitter and their pain points of using it. Aiming at promoting efficient and effective OSS communications, we discuss the main implications of our findings for OSS developers, communities, platform vendors, and researchers.

Based on our findings, we present the following implications for individual OSS developers to attract attentions and receive responses effectively and efficiently.

(1) Provide example code or data when seeking solution help (RQ3, RQ4). In Figure 5 , we reported that, there are nearly 1/3 dialogs are problem-oriented. In the live chat, it is important to provide example code in problem-oriented dialogs, to make other developers quickly understand and avoid missing key information. This finding is also in line with the evidence provided by previous studies [12, 14] on Stack Overflow. Here is an example of showing (2) Be aware of low-active hours (RQ1). Our results show that developers are more active during some time slices in live chat. Figure 3 (a) demonstrates the most active time slices are UTC 9-10, 13-14, and 18-21, corresponding to Central European/American daytime or Asia nighttime. Noticeably, more developer live chatting happens on Wednesdays and Thursdays than on other weekdays (UTC), which possibly corresponds to communication, coordination, and preparation for integration/release deadlines on Fridays. This observation also confirms the "commercially viable alternative" of the OSS projects reported in recent studies [23, 29, 68] . One of the common findings is that the traditional notion of OSS projects that are driven by voluntary developers is now outdated. OSS has become a commercially viable alternative, and some OSS projects have become critical building blocks for organizations worldwide. For example, Docker is widely used by software companies around the world 5 , including Adobe, AT&T, PayPal, etc. Therefore, instead of chatting on weekends, developers likely discuss their problems in the live chat on workdays.

While low-active time slices (UTC 1-6) mostly correspond to Central European/American nighttime or Asia day time. In cases where developers find issues and need support during low-active hours, we suggest several options. First, it is recommended to simultaneously post questions to other alternative platforms, e.g., issues and emails. Second, they better follow up in live-chat if not receiving timely responses to their questions posted during low-active hours. Finally, employ some automated reminder bot, for example, to review the list of questions posted during low-active hours.

(3) Avoid asking amid ongoing discussions (RQ4). When identifying the unanswered monologues patterns from dialogs, we note that 30% of them are launched in the middle of ongoing active and intense conversations on a different topic. In such cases, new questions are easily flooded by the utterances of the ongoing discussions. Therefore, to increase the opportunity of getting responses, developers could post their questions after the ongoing discussions. In case that an urgent matter emerges, we suggest that the platform vendors provide special accommodations to flag such urgency and redirect the team's attention to it, such as multi-threaded conversation (e.g., in Slack) or a highlight tag for urgent questions and let others supervise the usage of an urgent tag to avoid abuse.

we provide the following recommendations for OSS managers to improve the management and coordination of the communities.

(1) Mitigate the risk of single-point failure (RQ2). As reported in RQ2, the three Polaris communities (Appium, Docker, and Gitter) may have a higher risk of single-point failure, if the focal developer left or became inactive. It is noticeable that there are some second focal points smaller than those most focal ones in the three Polaris networks, and this may suggest practical strategies in order to mitigate the risk of single-point failure. For example, the Polaris communities may design and employ appropriate incentives or policies to second focal developers, for improving the resilience of the live chat communities.

(2) Improve OSS documentation for newcomers (RQ1, RQ4). It is reported that some newcomers complained that it is hard to start on a new project and get timely help from other community members, which may make them gradually lose motivation, or even give up on contributing [63] . To facilitate newcomers to familiarize themselves and make contributions in a more efficient manner, OSS communities may consider utilizing the results of our study to improve OSS documentation. For example, the results of RQ1 show active and low-active time slices, and the results of RQ4 show many unanswered monologues are asking amid ongoing discussions. That information could be incorporated into the README documents for newcomers, who are looking to contribute to a project and how to get timely help from others.

This section discusses several desired features for facilitating more productive conversations. Specifically, these are organized from communication platform vendors' perspective, in support of more intelligent and productive chatting options, leveraging on mining and knowledge sharing of intensive historical conversations.

(1) Highlight and organize conversation topics (RQ3). As suggested by the previous study, multi-dimensional separation of concerns [47] is a powerful concept supporting collaborative development by breaking a large discussion down into many smaller units. It highlights that the online communication platform vendors could provide support for a set of predefined panels that focus on certain topics. In the results of RQ3, we provide a taxonomy of discussion topics in live chat. The online communication platform vendors could refer to this taxonomy to create topic panels. For example, API-usage panel, Error panel, Background-info panel, etc. These multiple panels could bring the following benefits to community members: (i) quickly understanding and retrieving the intents of the dialog initiators; (ii) reducing interference and focusing more on topics of interest; and (iii) identifying important information reported by the developers.

(2) Annotate important questions (RQ3). When investigating dialog topics, we note that certain types of dialogs suggest information for future software evolution. For example, there are 3.6% dialogs discussing new features, and 8.28% discussing unwanted behaviors. These dialogs are valuable for product teams to plan future releases. In the meanwhile, other types of dialogs indicate unrevealed defects of existing systems. For example, 11.62% dialogs discuss errors, 4.81% discussing something that does not work, 3.87% discussing reliability issues, and 1.74% discussing performance issues. Properly annotating those dialogs with "Feature Request", "Enhancement", and "Bugs" would help to preserve valuable information, and contribute to productivity and quality improvement of the software. As an example, techniques for mining live chat have been explored for identifying feature requests from chat logs [57] .

Research in the SE area could dedicate to promote efficient and effective OSS communication in the following directions.

(1) Automatically recommend similar questions (RQ4). Existing online communication platforms only record the massive history chat messages, but do not consider a deeper utilization of those historical data. Actually, we note that developers post similar questions in live chat sometimes. In DL4J, one initiator posted a question, and he got a reply like this: "Someone else asked a very similar question a while ago. " However, it is not easy for the initiator to accurately retrieve the similar question out from the massive history messages. In addition, some questions that got unanswered are largely due to many similar questions being previously answered. Therefore, we consider that, it will save developers' effort if researchers could develop approaches that automatically recommend similar questions and the corresponding discussions.

(2) Automatically assign appropriate respondents (RQ4). By analyzing the dialogs belonging to exploring solution patterns (P1), we note that respondents who are not quite familiar with the technologies related to the posted questions might give ineffective solutions. Although such discussions could make developers understand the problem better, the multiple fail-and-try interactions still prolong the process of issue-resolving. Therefore, to make conversations more productive, it is expected to develop approaches that could recommend or assign appropriate respondents according to their historical answers.

(3) Automatically push valuable information to project repositories (RQ3). Valuable information such as feature requests or issue reports, either manually annotated by developers or automatically detected by tools, needs to be well-documented and well-traced in the scope of project repositories. Typically, code repositories such as Github or Gitlab provide the functionality of issue tracking. It would be more efficient if researchers could provide a convenient way to directly push or integrate the valuable information into the code repository. In addition, the following linguistic patterns might be helpful to automatically classify dialog topics. It is observed that questions from the API usage category include phrases such as: "how to do sth?", "how can I do sth?", "can anyone help me with sth?", "is there any way to sth?", or "I want/need to do sth". For example, "How to bundle my Angular 2 app into a 'bundle.js' file?" and "Can anyone help me with PWA using angular 2". Questions from the Error category are likely to be "I get/receive this error", "Anyone had an error like this", "Does anyone know a solution for sth", or directly posting the specific exception names. For example, "I receive a NotFound error" and "Anyone had an error like this before when trying to load a route?" Questions from the Background info category are likely to be "what/why/when... ", "I would like to know sth", or "is there sth for... ". For example, "When Appium will support Xcode 8.2?" and "Are there any limitations for automating the iOS app made with Swift 3?".

(4) Analyze effects of social chatting (RQ3). As reported in RQ3, 10.95% of dialogs are non-domain related (N-DR) topics such as general development or social chatting. A recent study [46] emphasizes an important role of social interactions, such as the simple phrase "How was your weekend?", to show peer support for developers working at home during the COVID-19 pandemic. Future work may explore more patterns and effects of social chatting, e.g., pre/post-pandemic comparison.

External Validity. The external threats relate to the generalizability of the proposed approach. Our empirical study used eight Top-1 most participated open source communities from Gitter. Although we generally believe all communities may benefit from knowledge learned from more productive, effective communication styles, future studies are needed to focus on less active communities and comparison across all types of communities.

Internal Validity. The internal threats relate to experimental errors and biases. The first threat relates to the accuracy of the dialog disentanglement model adopted by us. Although we select the best model to disentangle dialogs from the state-of-the-art approaches, the accuracy score for the best model is still not quite satisfactory. It will have an impact on the results of RQ1 and RQ2.

To address this issue, one of our ongoing works is to build a new efficient dialog disentanglement model based on deep learning to improve the accuracy of existing disentanglement approaches. The second threat relates to the random sampling process. Sampling may lead to incomplete results, e.g., topic taxonomy and interaction patterns. In the future, we plan to enlarge the analyzed dataset and inspect whether new topics or interaction patterns are emerging. The third threat might come from the process of manual disentanglement and card sorting. We understand that such a process is subject to introducing mistakes. To reduce that threat, we establish a labeling team, and perform peer-review on each result. We only adopt data that received the full agreement, or reach agreements on different options.

Construct Validity. The construct threats relate to the suitability of evaluation metrics. In this study, manual labeling of topics and interactive patterns is a construct threat. To minimize this threat, we use a well-known approach used by previous work [5, 10, 53] to build reasonable taxonomies for textual software artifacts.

Our work is related to previous studies that focused on synchronous and asynchronous communication in the OSS community.

Synchronous Communication in OSS community. Recently, more and more work has realized that live chat via modern communication platforms plays an increasingly important role in team communication. Lin et al. [40] conducted an exploratory study on understanding the role of Slack in supporting software engineering by surveying 104 developers. Their research revealed that developers use Slack for personal, team-wide, and community-wide purposes, and developers use bots for team and task management in their daily lives. They highlighted that live chat plays an increasingly significant role in software development, replacing email in some cases. Shihab et al. [58, 59] analyzed the usage of developer IRC meeting channels of two large open-source projects from several dimensions: meeting content, meeting participants, their contribution, and meeting styles. Their results showed that IRC meetings are gaining popularity among open source developers, and highlighted the wealth of information that can be obtained from developer chat messages. Yu et al. [67] analyzed the usage of two communication mechanisms in global software development projects, which are synchronous (IRC) and asynchronous (mailing list). Their results showed that developers actively use both communication mechanisms in a complementary way. To sum up, existing empirical analysis of live chat mainly focused on the usage purpose [40] , the usage of live meetings [58, 59] , and comparison with different communication mechanisms and knowledge-share platforms [67] . There is a lack of in-depth analysis of the community properties and the detailed discussion contents. Our study bridges that gap with a large-scale analysis of communication profiles, community structures, dialog topics, and interaction patterns in live chat.

Asynchronous Communication in OSS community. Prior studies have empirically analyzed asynchronous communication in the OSS community, including mailing-list, issue discussions, and Stack Overflow. Bird et al. [11] mined email social network on the Apache HTTP server project. They reported that the email social network is a typical electronic community: a few members account for the bulk of the messages sent, and the bulk of the replies. Di Sorbo et al. [60] proposed a taxonomy of intentions to classify sentences in developer mailing lists into six categories: feature request, opinion asking, problem discovery, solution proposal, information seeking, and information giving. Although the taxonomy has been shown to be effective in analyzing development emails and user feedback from app reviews [48] , Huang et al. [36] found that it cannot be generalized to discussions in issue tracking systems, and they addressed the deficiencies of Di Sorbo et al.'s taxonomy by proposing a convolution neural network based approach. Arya et al. [4] identified 16 information types, such as new issues and requests, solution usage, etc., through quantitative content analysis of 15 issue discussion threads in Github. They also provided a supervised classification solution by using Random Forest with 14 conversational features to classify sentences. Allamanis and Sutton [3] presented a topic modeling analysis that combines question concepts, types, and code from Stack Overflow to associate programming concepts and identifiers with particular types of questions, such as, "how to perform encoding". Similarly, Rosen and Shihab [51] employed Latent Dirichlet Allocation-based topic models to help us summarize the mobile-related questions from Stack Overflow. Our work differs from existing research in that we focus on synchronous communication which poses different challenges as live chat logs are informal, unstructured, noisy, and interleaved.

In this paper, we have presented the first large-scale study to gain an empirical understanding of OSS developers' live chat. Based on 173,278 dialogs taken from eight popular communities on Gitter, we explore the temporal communication profiles of developers, the social networks and their properties towards the community, the taxonomy of discussion topics, and the interaction patterns in live chat. Our study reveals a number of interesting findings. Moreover, we provide recommendations for both OSS developers and communities, highlight advanced features for online communication platform vendors, and provoke insightful future research questions for OSS researchers. In the future, we plan to investigate how well can we automatically classify the dialogs into different topics, as well as attempt to construct knowledge bases according to already answered questions and their corresponding solutions from live chat. We hope that the findings and insights that we have uncovered will pave the way for other researches, help drive a more in-depth understanding of OSS development collaboration, and promote a better utilization and mining of knowledge embedded in the massive chat history. To facilitate replications or other types of future work, we provide the utterance data and disentangled dialogs used in this study online: https://github.com/LiveChat2021/LiveChat.

A Tool to Prototype and Experiment Angular Codes

Why, When, and What: Analyzing Stack Overflow Questions by Topic, Type, and Code

Analysis and Detection of Information Types of Open Source Software Issue Discussions

Going Big: A Large-scale Study on What Big Data Developers Ask

Gephi: An Open Source Software for Exploring and Manipulating Networks

Communication Patterns in Task-oriented Groups

Analyzing the Relationships between Android API Classes and Their References on Stack Overflow

Automatically Classifying Posts into Question Categories on Stack Overflow

A Manual Categorization of Android App Development Issues on Stack Overflow

Mining Email Social Networks

Building Reputation in StackOverflow: An Empirical Investigation

Geodesic Distance in Planar Graphs

How to Ask for Technical Help? Evidence-based Guidelines for Writing Questions on Stack Overflow

Software-related Slack Chats with Disentangled Conversations

Microsoft Corporation. 2020. Nodejs

Microsoft Corporation. 2020. Typescript

Logic and Uncertainty in Information Retrieval

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Graph theory

Making Sense of Card Sorting Data

The Transformation of Open Source Software

Ethereum Foundation. 2020

JS Foundation. 2020. Appium

A Set of Measures of Centrality Based on Betweenness

Gitter. 2020. REST API

Trends in Free, Libre

Who Is Answering to Whom? Finding "Reply-To" Relations in Group Chats with Long Short-Term Memory Networks

Communication in Open Source Software Development Mailing Lists

What Do Programmers Discuss About Deep Learning Frameworks

Automating Intention Mining

Communication in Open-source Projects-end of the E-mail Era

Building Sustainable Communities through Social Network Development

A Large-scale Corpus for Conversation Disentanglement

Why Developers Are Slacking Off: Understanding How Software Teams Use Slack

Endto-End Transition-based Online Dialogue Disentanglement

Haste Makes Waste: An Empirical Study of Fast Answers in Stack Overflow

Network Structure in Virtual Organizations

Predicting Failures with Developer Networks and Social Network Analysis

Software Development Teams Working From Home During COVID-19

Multi-Dimensional Separation of Concerns in Requirements Engineering

How Can I Improve My App? Classifying User Reviews for Software Maintenance and Evolution

GitterCom: A Dataset of Open Source Developer Communications in Gitter

User Intent Prediction in Information-seeking Conversations

What Are Mobile Developers Asking About? A Large Scale Study Using Stack Overflow

The Sorting Techniques: A Tutorial Paper on Card Sorts

Decomposing the Rationale of Code Commits: The Software Developer's Perspective

On the Use of the Adjusted Rand Index as a Metric for Evaluating Supervised Classification

Social Network Analysis in Software Development Projects: A Systematic Literature Review

Thread Detection in Dynamic Text Message Streams

Detection of Hidden Feature Requests from Massive Chat Messages via Deep Siamese Network

On the Use of Internet Relay Chat (IRC) Meetings by Developers of the GNOME GTK+ Project

Studying the Use of Developer IRC Meetings in Open Source Projects

Development Emails Content Analyzer: Intention Mining in Developer Discussions (T)

Cluster Ensembles -A Knowledge Reuse Framework for Combining Multiple Partitions

Performance Analysis of Molecular Complex Detection in Social Network Datasets

A first look at good first issues on GitHub

How Do Programmers Ask and Answer Questions on the Web

Social Network Analysis: Methods and Applications

Collective Dynamics of 'Smallworld' Networks

Communications in Global Software Development: An Empirical Study Using GTK+ OSS Repository

How Do Companies Collaborate in Open Source Ecosystems?: An Empirical Study of OpenStack

We deeply appreciate anonymous reviewers for their constructive and insightful suggestions towards improving this manuscript.