key: cord-0295261-c4jxlyhw authors: Wang, Dakuo; Wang, Haoyu; Yu, Mo; Ashktorab, Zahra; Tan, Ming title: Group Chat Ecology in Enterprise Instant Messaging: How Employees Collaborate Through Multi-User Chat Channels on Slack date: 2019-06-04 journal: nan DOI: 10.1145/3512941 sha: 97520353cf11a262a9fb0ed706b694f77744e09b doc_id: 295261 cord_uid: c4jxlyhw Despite the long history of studying instant messaging usage, we know very little about how today's people participate in group chat channels and interact with others inside a real-world organization. In this short paper, we aim to update the existing knowledge on how group chat is used in the context of today's organizations. The knowledge is particularly important for the new norm of remote works under the COVID-19 pandemic. We have the privilege of collecting two valuable datasets: a total of 4,300 group chat channels in Slack from an R&D department in a multinational IT company; and a total of 117 groups' performance data. Through qualitative coding of 100 randomly sampled group channels from the 4,300 channels dataset, we identified and reported 9 categories such as Project channels, IT-Support channels, and Event channels. We further defined a feature metric with 21 meta features (and their derived features) without looking at the message content to depict the group communication style for these group chat channels, with which we successfully trained a machine learning model that can automatically classify a given group channel into one of the 9 categories. In addition to the descriptive data analysis, we illustrated how these communication metrics can be used to analyze team performance. We cross-referenced 117 project teams and their team-based Slack channels and identified 57 teams that appeared in both datasets, then we built a regression model to reveal the relationship between these group communication styles and the project team performance. This work contributes an updated empirical understanding of human-human communication practices within the enterprise setting, and suggests design opportunities for the future of human-AI communication experience. Instant messaging systems (IM) emerged in 1980s then slowly being adopted by individuals as well as organizations in the past few decades. With years of research and development effort in CSCW (e.g., [3, 8, 10, 16, [26] [27] [28] 38] ), modern IMs (e.g., Slack and Skype) are much more advanced. More and more companies and organizations are taking these IMs as granted that their workers exchange information and files so efficiently. Slack is one such tool that has been adopted by offices and work spaces [4-6, 24, 32, 33] . Comparing to the previous generation of IMs used in workplaces (e.g, Lotus Sametime), it emphasizes more on the group chat feature and provides a much better user experiences for multi-parties chats. It is estimated that Slack reduces emails by 32% and reduce meetings by 23%, it also helps new employees to reach full productivity 25% sooner [31] . In parallel to the recent advancement of IM technologies, text-based AI applications (e.g., Chatbots or topic summarization) reside in IMs has attracted many interests from researchers and business. These applications are often built based on a language model that can simulate how human chat with another human [30, 35, 37] . Many HCI researchers and natural language processing (NLP) researchers have invested tons of efforts in building better algorithms and applications, but often failed to do so [2, 11] . Partially because many of such efforts have been based on an outdated understanding of how people chat with each other and the communicate styles ( e.g., [34] and many other NLP works assumed that a conversation thread has more than 10 turns and less than 100 turns). Thus, there is an urgent need to update the existing understanding of how people are chatting in groups and in a finer-grained level, such as how many conversation threads with different topics are incurred in a group chat channel and how deep each thread is. Another dimension of fine-grain understanding is the group chat ecosystem. Often time AI researchers focus only on a single context of group chats [34] . For example, the famous humanhuman group chat datasets being used by NLP researcher are Ubuntu-IRC [9] and the Twitch dataset [14] , which focus on IT development and Gaming thus has limited generalizability. Another line of research relies on synthetic datasets often generated from the Reddit [34] or StackOverflow, but these synthetic datasets from online forums do not guarantee the similarity in communication styles in the real IMs datasets. This paper aims to 1) update the understanding of organizational group chat practices, and 2) examine whether the group chat styles can quantitatively reflect these groups' team performance . As a first step, we collected a total number of 4,300 group chat channels being created and used by 8,000 employees in a R&D division in a big IT company. We collected both the meta data and the raw messages of these channels spanning from Mar 2016 (Slack was first introduced) to Mar 2019 (this Paper was written). Then we randomly selected 100 group channels and manually coded them into 9 different categories. One particular challenge is that the content of such organizational data corpus is often proprietary, thus we further designed a machine learning classification model that leverages only the meta data of such chat channels (e.g., # of messages and # of threads), but not the content, to identify a channel's category. As for the second research goal, many prior literature have suspected that team's communication and collaboration patterns may predict team performance [4, 12, 18, 29, 36] . In our study, we have the privilege to access another data corpus with 117 teams' performance data. Through crossreferencing these two datasets, we identified 54 teams that existed in both data corpus. Thus, our paper can also examine whether a group's communication style can quantitatively reflect these groups' team performance. There are decades of CSCW literature on IM system designs and user studies of it that we do not have space to cover all of them. We only want to point out one particular genre of research on IMs that is around multiple-parties group chat. Back in 1994, McDaniel et al. compared how people chat with each other in Face-to-Face setting (FTF) versus in text-based Computer-Medicated Communication (CMC) systems, which is the early name of IMs and Videoconferencing systems [26] . They found that a group of people chat on multiple concurrent threads in a conversation (2 to 3 for FTF and 4 to 6 for CMC) and the threads have a different timespan (2.8 mins for FTF and 23.3 mins for CMC). These results are intriguing but the data corpus and the analysis method they used were quit primitive: they analyzed 6 chatting groups and labeled the timestamp for each message, put them on the same timeline, and manually counted the threads and the numbers. Another notable research work about early days IMs usage is from Bonnie Nardi and Steve Whittaker [27] in 2000. They particularly focused on the early adopters of an IM system in a workplace setting and reported how they used or would like to use IMs. For example, they reported that users use IMs not only for work-related question and answering type of activities ("Interaction"), they also use it for informal purposes such as checking whether someone is available for a chat or not. And they also reported that those early-days IM systems were designed based on a dyadic "call" model, where users use it more like a phone to find another individual to chat, but preferred not to use the "chat room" type of features. Together with other research effort, the design implications stemmed from these research findings have guided the following two decades of IM system design, e.g., having a status feature and lacking a groupchat feature in workplace IMs. Now that new tools exist and people are using them, it is time for us to revisit this research topic after 20 years. In addition to the HCI research effort of understanding human chat behavior, NLP researchers have also investigated how to advance techniques to automatically analyze text-based human conversations. However, most of these NLP researches focus on the one-on-one conversation scenarios (i.e. conversation between a chatbot and a human user), on which information retrieval traditional methods, syntactic/semantic parsing techniques, and neural sequence-to-sequence generation models are integrated into the chatbot [15, 40, 43] . For the domain of analyzing group chat conversations, there are a few works on disentangling interleaved conversational threads to form threads discussing single topics [9, 19, 25, 34] and extracting knowledge from conversational dialogues [17] . These works have limitations in the sense that they neglect the richness in conversation patterns in different conversational categories. For example, the work [34] was conducted on particular interest-based Reddit forums; and the work [17] focused only on educational related topics, thus they have limited generalizability to other domains or contexts. One notable research work that is particularly relevant to our project is from Lin et al. [24] They designed an exploratory study to investigate how software engineers use Slack. Through a survey study, the survey respondents replied why and how they use Slack. For example, the respondents mentioned three types of benefits of using Slack: personal benefits, team-wide benefits, and community benefits. Lin's work sheds light on the variate use patterns and perceived benefits of group chat, but because their data were only from qualitative survey results, it did not provide actionable insights on how to materialize their findings into the design of algorithms or applications in the group chat scenario. Our work focuses on the real-world workplace group chat ecology. We argue a better categorization of different types of conversation groups through modeling with these chat groups' meta-information (without looking at the potentially sensitive message content) is necessary for the downstream NLP algorithms or HCI system building tasks. One downstream application of identifying communication styles for groups is that we may use it to predict the team performance. There has been extensive prior literature on this topic. For example, Zhang et al. applied topic models analysis on chat messages to investigate the evolution of team dynamics over a long term project [42] . They describe common behaviors and team cohesion dynamics. It is well known in the CSCW community that people who work on teams may not put as much effort than if they were working individually (i.e., "Slackers") [20] . Furthermore, coordinating individuals' contributions through communication is challenging and critical [7] , and many CSCW systems have been proposed to address some of these challenges [22] . One notable recent work from Cao et al. [4] . They designed an online experiment and recruited crowdworkers to form chat groups and to perform a group-based task. Then, with various machine-generated features and human-labeled features, they built machine learning models to predict team's viability, which is a metric reflecting how successful the team collaboration is. Their machine-generated features include mostly the computational linguistic features (e.g., pronoun) derived from the message content and two meta features (i.e., # of messages per person, and # of words per person). In this current work, we follow the trend of adopting machine learning methods to extract group behaviors and then to predict the group performance. In comparison to these prior works, our work investigates a real-world dataset as opposed to an experimental study (e.g., [4] ), and our model does not require the actual message content to preserve user privacy. In this section, we describe our datasets, the open coding analysis method we used to identify the 9 categories of group chat channels, the machine learning methods that we used to pre-process the datasets and extract the feature metrics, and the regression method that we used to analyze the relationship between the features metrics and the team performance in a subset of 54 project team's dataset. In total, there are two datasets being used in this study, a Slack message dataset with 4,300 public channels in an R&D department in a company, and a dataset with 117 project teams with the team composition and performance information. For the 4,300 channels dataset, we randomly select 100 channels as a subset for further manual coding analysis. For the 117 project team performance dataset, we cross-reference with the 4,300 project channels and identified that 54 project teams have a designated and publicly-accessible group chat channel, thus we then prepared a sub-dataset for the 54 project teams with both the project performance and its Slack group chat data. Here, we would like to provide a bit more details about the Project Team performance dataset. In this R&D department in the multinational IT company, a re-organization occurred in November 2017 and 117 project teams (not the same as organization teams) were formed. For a few months period after November 2017, these teams all were encouraged to submit papers to an academic conference as their primary goal, thus 146 submissions were generated by the conference submission deadline. From an internal project management portal, we collected information about these project teams, such as the project description and team members' information. We use whether a project team generates a paper submission as the final outcome to reflect its performance. If there are one or more submissions to the conference, we denote 1 to the outcome variable, otherwise 0. We acknowledge that this way of describing team performance has many limitations and we will elaborate in the limitation section by the end of the paper. In addition to the project performance, each of the project team is also required to have a Slack group chat channel, but many of those channels are private to the team members due to confidentiality purpose. At the end, we are only able to collect 54 Slack channels for the 117 groups. Inspired by prior literature [24] and based on our observation, the group chat channels can vary tremendously in the number of members or some other characteristics. Thus, we decided to first conduct a qualitative analysis to identify the different types of Slack groups. We randomly selected 100 channels for manually labeling the categories, and we were prepared to code more channels if new categories keep emerging. The result of the 9 categories suggested that our code has reached the saturated thus we stopped. In particular, two authors of this paper independently conducted thematic content analysis for each of the 100 Slack channels by reading the content of the Slack group, and various meta-data such as channel description and the number of members. Independently the two authors coded each of the Slack groups and took notes why they believed so. Then, the two authors discussed their notes and coding schema (without revealing the code for each channel), and finalized a list of 9 categories (see Table 1 ). The two authors then re-coded the 100 slack channels with the agreed code list for a cohen kappa score of 0.83. To depict the group chat ecology and to support our research purpose, we collected 21 features based on its meta information to represent a group chat channel. These 21 features (summarized in Table 2 ) are the whole suite of features that Slack API can provide to a developer or researcher without accessing the actual message content. Many of these features are self-explanatory thus here we focus only on the confusing ones to elaborate. The #members is the number of people who have joined in the channel and #active_users represents the number of people with at least one message in the channel. Active members could be larger than current members because people may leave the channel after they post messages. #active_timespan captures the number of days from the day that the first message was posted in the channel to the day that the last message was posted. word_per_message is an average number words in each message. Sometimes there might be a single user dominated the whole group channel that we also measure the max number of messages by one user. A very common interactive function in Slack is to notify a specific user by @username or the entire group by @channel or @here. We thus include the corresponded number of messages for those three different types of messages as #at_messages, #channel_messages, and #here_messages. A more complicated interaction is that many people tend to explicitly form a "thread" in a Slack channel for a small discussion on one topic, we generate feature #threads to represent the number of threads within the channel. We also have max_turn_thread and avg_turn_thread to capture the maximum and average number of turns in a thread. Another common used function in Slack is to "react" to a specific message by a simple emoji. We have feature max_count_reaction and avg_count_reaction for maximum and average number of reactions for a message. #emoji_messages will capture the number of messages with emojis within it. The #pinned_messages represents the number of messages "pinned" by the users in the channel which they feel is important. Since people may share code snippets in a channel, we introduce #code_messages for the number of messages with code snippets. We also have #url_messages and #git_messages to represent the number of messages which contains an URL and the number of messages which contains an URL specific to a Github page. People are also able to easily share files within a channel, we also capture such behavior by having #file_messages to capture the number of messages contains file sharing. As for Slack, the owner of the channel could introduce a "Slack bot" which interacts with people in different ways such as general question answering bot, alert notification bot, etc. We also count the number of messages generated by the Slack bots as #bot_messages. Inspired by liteature [4] , some of these features may highly correlated to how many members are there in a chat channel, how long the channel has been established, and how many total messages there are. Thus, for each of the count-based features (from feature 6 to feature 21), we also compute three normalized derived features by dividing #messages (producing 12 new features), #active_users (producing 13 new features) and active_timespan (producing 13 new features). In total, we end up having 59 meta features as representation of each slack group in the machine learning model. In order to provide an actionable research asset for fellow researchers and developers to classify group chat categories, we conduct an automatic Slack group classification task using all 59 features. After extracting the feature vector for each labeled Slack channel, we build an ensemble tree-based classifier from [13] to predict the category of each channel. The rationales of choosing this model are two folds: first, given our features, it is straightforward for a decision-tree based algorithm to learn good rules that are non-linear combinations of features; second, the number of coded Slack groups are limited and an ensemble-based approach could help prevent over-fitting. As for the evaluation process, we are focusing on the overall classification Accuracy as well as Precision and Recall for each label. Due to the limited coded channels, we follow the leavingone-out cross-validation method from [21] to measure the overall classification accuracy, which is widely used for model evaluation on small data set. To fulfill our second research goal, we conducted a logistic regression on the subset of the data with 54 project teams. We use their Slack channel's features (all 59 with both raw features and derived features) as independent variables and the binary dependent variable representing whether they have a publication or not. Among the 54 project teams in our dataset, 35 had published papers and 19 had not published. We used recursive feature elimination algorithm [39] to identify the best features in the model that were the most predictive. Through the qualitative coding of 100 randomly sampled channels, we were able to identify 9 different categories. The code names and the descriptions are all as listed in Table 1 , thus we would not repeat here. Based on the manual coding of 100 data points, the machine learning algorithm [13] can also pick up the difference between categories. Through our experiment, the machine learning model for identifying different categories could achieve 66% overall Accuracy on the coded 100 channels. As for Project category, we could get Precision 79.4% and Recall 87.1%, which we believe is reasonable good for our downstream task (Section 4.2) on this specific category. We could also get reasonable performance for other labels except for Employee Support and Announcement, simply because we have only 2-3 data points for each of those categories. By examining the features for each category, we notice that there are several different communication styles. Within table 2, we provide the average value for each feature of three different categories 1 including Project, Social, and IT Support . It is quite intuitive to see the differences in some of characteristics of each category by looking at several representative features: • Project Category: From the results in table 2, we can find that a Project slack channel usually has fewer messages (355.9) and fewer number of members (11.1) compared to the other two groups which have more than 3,000 messages on average and about 100 members. We believe this is natural for a Project Slack channel, since it consists of a few people working on the project that form a centralized communication. Even the total number of messages is limited as they may communicate locally as well, the percentage of #file_messages are substantially higher than the other categories as we believe members in such channels tend to share files for productive collaboration on a project. If we examine active_timespan of this category, we could also notice that the active time span is shorter than the other two categories because of a project is supposed to finish within a period. We also notice similar characteristic for Event group where the active time span is even shorter (144 days), as people tend to quickly form a slack group discussion for a specific event and then become inactive as the event finishes. • Social Group Category: A Social group usually has a large number of messages, where the amount of messages is similar comparing to the IT Support group we will discuss below. But the Social groups could be differentiated from the other types of groups, including the IT Support groups, based on features like #emoji_messages and #reacted_messages. For this category, we could observe more frequent usages of emojis (284.7) because of the more casual style communication in such channels. Similarly, we notice that for Bot channels, the emojis are also widely used as people tend to build the Slack bots with emojis in the conversation to make the bots more friendly. For the Social groups, another feature with higher values than the others is the #reacted_messages, which is the number of messages with emoji reactions like thumbs-up and thumbs-down. This is similar to the number of emoji messages as people tend to form a casual communication style. Members in this channel also tend to use @here more frequently than in other channels. We hypothesis the reason behind is that people wish to eagerly share content with all other members. • IT Support Category: As for IT Support category, it usually involves messages trying to solve a specific technical issue so that the number of messages containing code snippet (81.8) is significantly higher than the other two groups which have less than 10 such messages. There are some other features which could help differentiated this category. We notice that the averaged number of turns in threaded messages (4.1) for this category is significantly higher than all the other categories. We believe the reason is that people need multiple turns of conversation in order to solve an technical issue. If we look at the feature #mes-sage_top_user, we could notice that the value for IT Support (1819.9) is substantially higher than others and we think it suggests that the user behind is the one who actively provides solutions to most of the IT problems. As for #bot_messages, this type of channel has more messages sent by Slack bot. This is in accordance with our findings that many IT Support channels are using Slack bots to handle some basic questions and frequently asked questions (FAQ) with regard to service status or product update. We also notice a significant higher usage (953.3) of @username and we believe it is for people to tag specific users to solve a specific technical issue. To accomplish the second research goal of assessing the feasibility of modeling team performance with only the group communication meta data, we feed all the 59 features (raw and derived) in a recursive logistic regression model. The model selects the following final features for predicting whether the team has a submission or not: active timespan in days, number of bot messages, number of @here messages, normalized messages for the top user, normalized threads, normalized emoji, and normalized code. These features yielded an 2 of 0.61. The results of our logistic regression are in Table 3 . Below we briefly discuss these features. Active time span is measure by the days span from the very initial message to the last message. As members work together longer time, they are more likely to have a better output (marginal significant). We found that the number of bot messages is correlated (but not significantly) to success outcome. As the most used bots across these challenges are the Github bot, it may represent the team is more active in Github related activites that has a better outcome. Other features are easy to understand: the result suggests that if there is a single user publishes a lot of the messages in a channel, the more conversation threads created in the channel, the more programming codes, the more emojis, and the more here messages are used, the more likely the team has a paper submission. This is the first step towards a promising research future that organizations and managements may leverage only the meta data of group communication practices to predict a team's performance. In this section, we describe the the implications of our results. We contribute to updating the understanding of communication styles and in various context-specific categories. We trained a machine learning model that can automatically categorize Slack group channels in the workplace at reasonably high accuracy. By looking closely at a subset of data with 57 teams that have both Slack communication data and project performance data, we further discuss how this communication style feature metric could be useful for modeling the team performance. Through our manual coding, we are able to identify 9 different categories and their distinct communication patterns. This finding updates the existing knowledge of how groups communicate in IM systems. For example, [26] in 1998 reported 156.7 words per thread, whereas in our study we found that number differed in different categories of Slack channels: e.g., Project channels have 32. Readers can reference to early day's CSCW publications (e.g., [26, 27] ), while reading our finds. Based on 4,300 data points and the 59 feature metrics we extracted from the meta information, we also built an ML model to automatically classify a group chat channel's category without looking at the actual message content. We suggest the HCI and NLP researchers who are doing content analysis on group chat message data corpus may first use our model to categorize their group chat channels before building any downstream applications. The current practices of treating all types of group channels equally and using one single setting for all the categories may be misleading (e.g., [41] investigates how to generate summarizations for group chat messages). Also, for NLP researchers who are building deep learning models with synthetic datasets for group chats, they should consider the particular domain and the context that their target audience is in, and refer to our Table 4 to construct the data corpus that follows a natural distribution. For example, [34] builds a deep learning model to automatically extract threads from un-threaded group chat channels with a synthetic Reddit dataset that assumed a thread of conversation should consists of 10 to 100 messages, whereas our results show that on average a thread has 2 to 35 messages. One particular active research topic in IM systems today is the AI-powered conversational agent systems [1, 23] . For example, [23] built and tested an HR bot to support the new employee onboarding process. But these chatbots often rely on dialogue acts or manually crafted dialogue trees, but rarely consider the group conversation context in which the chatbot will be deployed. Using our work's findings, researchers and developers can also consider the contextual norm of human group chat channels. Thus, for different categories of channels, the chatbot can behave differently. For example, when a chatbot is for an Event organization group, it may use much fewer threaded conversations than in an IT Support group. That is how an intelligent conversational system can better fit into the human-human communication group. We also envision that the group communication feature metrics could be use to predict team performance in the future. Though not every feature is significant in the regression model, the goodness of the full model ( 2 = 0.61) hints at the promising future of this line of research. If we can build a dashboard or a system that actively track the group conversations in their team channels of a project team, we may be able to have a real time meter for the program team's performance, and an early sign of project failure could be detected. Our study has a couple limitations. First, the context is within a R&D department of a multinational IT company, and we use a publication as a proxy for success in this study. It is important to note that success can be defined in other ways in projects (patents, product impact). However, within the context of this time-bounded re-organization even in the R&D department, it is sufficient to use publications as a proxy for success. Thus, the reader should be warned that some of the results from this study, such as what communication styles lead to higher group performance, may not be generalizable to other contexts. Secondly, the pre-trained machine learning model for category identification may not generalize well for group chat dataset other than Slack, and there will be a need for retraining of the model on the new dataset to fit for a different feature distribution. In this short note, we provide a comprehensive set of three analyses on understanding communication styles in Slack group chat channels in today's workplace settings. We first manually coded 9 different categories of group categories. Then, based on a communication style metric with 21 metadata features, we built a machine learning model that can automatically categorize group chat channels. Finally, we illustrated that these features could be used to unveil the relation between communication styles and the success of a project team. Resilient chatbots: Repair strategy preferences for conversational breakdowns Why Most Chatbots Fail The adoption and use of 'Babble': A field study of chat in the workplace My Team Will Go On: Differentiating High and Low Viability Teams through Team Interaction Software-related slack chats with disentangled conversations Exploratory study of slack Q&A chats as a mining source for software engineering tools Productivity loss in brainstorming groups: Toward the solution of a riddle Awareness and coordination in shared workspaces Disentangling chat Socially translucent systems: social proxies, persistent conversation, and the design of "babble Utilization of Self-Diagnosis Health Chatbots in Real-World Settings: Case Study Coordination, overload and team performance: effects of team communication strategies Extremely randomized trees Streaming on twitch: fostering participatory communities of play within live mixed media Towards an open-domain conversational system fully based on natural language processing Out of sight, out of sync: Understanding conflict in distributed teams Learning knowledge graphs for question answering through conversational dialog Effects of champion behavior, team potency, and external communication activities on predicting team performance Learning to disentangle interleaved conversational threads with a siamese hierarchical network and similarity ranking Social loafing: A meta-analytic review and theoretical integration Algorithmic stability and sanity-check bounds for leave-one-out cross-validation Applying social psychological theory to the problems of group work. HCI models, theories and frameworks: Toward a multidisciplinary science All work and no play Why developers are slacking off: Understanding how software teams use slack Hierarchical conversation structure prediction in multi-party chat Identifying and analyzing multiple threads in computermediated and face-to-face conversations Interaction and outeraction: instant messaging in action Distance matters How people write together now: Beginning the investigation with advanced undergraduates in a project course Face Value? Exploring the effects of embodiment for a group facilitation agent The Business Value of Slack Understanding coordination in global software engineering: A mixedmethods study on the use of meetings and Slack Slack me if you can! using enterprise social networking tools in virtual agile teams Context-Aware Conversation Thread Detection in Multi-Party Chat From Human-Human Collaboration to Human-AI Collaboration: Designing AI Systems That Can Work Together with People Organizational Distance Also Matters: A Case Study of Distributed Research Teams and their Paper Productivity CASS: Towards Building a Social-Support Chatbot for Online Health Community Did it have to end this way? Understanding the consistency of team fracture Feature selection and analysis on correlated gas sensor data with recursive feature elimination Docchat: An information retrieval approach for chatbot engines using unstructured documents Making sense of group chat through collaborative tagging and summarization The I in Team: Mining Personal Social Interaction Routine with Topic Models from Long-Term Team Data The Design and Implementation of XiaoIce We thank all the reviewers for their valuable revision comments. We particularly thank Stacy F Hobson and Talia Gershon for their help and support.