key: cord-0277858-uhrxik1r authors: An, Pengcheng; Zhou, Ziqi; Liu, Qing; Yin, Yifei; Du, Linghao; Huang, Da-Yuan; Zhao, Jian title: VibEmoji: Exploring User-authoring Multi-modal Emoticons in Social Communication date: 2021-12-27 journal: nan DOI: 10.1145/3491102.3501940 sha: 5aa639083e98513e8a08cd32ca0358ef98434093 doc_id: 277858 cord_uid: uhrxik1r Emoticons are indispensable in online communications. With users' growing needs for more customized and expressive emoticons, recent messaging applications begin to support (limited) multi-modal emoticons: e.g., enhancing emoticons with animations or vibrotactile feedback. However, little empirical knowledge has been accumulated concerning how people create, share and experience multi-modal emoticons in everyday communication, and how to better support them through design. To tackle this, we developed VibEmoji, a user-authoring multi-modal emoticon interface for mobile messaging. Extending existing designs, VibEmoji grants users greater flexibility to combine various emoticons, vibrations, and animations on-the-fly, and offers non-aggressive recommendations based on these components' emotional relevance. Using VibEmoji as a probe, we conducted a four-week field study with 20 participants, to gain new understandings from in-the-wild usage and experience, and extract implications for design. We thereby contribute both a novel system and various insights for supporting users' creation and communication of multi-modal emoticons. The use of emoticons has become ubiquitous, cross-cultural, and increasingly essential in online social communications. As a universal pictogram or visual language, emoticons facilitate communications within and across different linguistic and cultural backgrounds [11] . Moreover, they enable people to express themselves concretely, conveying emotions, feelings, or non-verbal reactions that cannot be easily articulated by words. As a result, over 20% of tweets now include an emoji [6] , and five billion emojis are sent everyday on Facebook Messenger [7] . Along with the fast-growing popularity of emoticons, new emoticons still continue being demanded by users for various communication needs [17] . Meanwhile, the classic, static emoticons seem to no longer fully suffice people's various needs for expression. Users continue seeking further enriched, more expressive, and customized ways to communicate via emoticons. This trend could be well demonstrated by the recent emergence of multi-modal emoticons across major social applications: i.e., emoticons augmented by multi-modal effects such as animations or vibrotactile feedback. For example, Apple users could customize the appearance of emoticons (Memoji [2] ) and record animations for stickers (Animoji [49] ). Google Gboard [18] allows users' mashups of two emojis. Tencent WeChat [47] and Huawei MeeTime [24] enable both animation and haptic effects of specific emoticons. The rapid emergence of multi-modal emoticons reflects users' desires for a more customized, multi-modal experience in social interaction. Despite the burgeoning of multi-modal emoticons in consumer products, the HCI research field has accumulated little knowledge about how people create, share and experience multi-modal emoticons in daily communications and how to better support them by design. The exploration presented in this paper sets out to tackle this under-explored opportunity. Namely, we build and evaluate VibEmoji, a user-authoring multi-modal emoticon system that supports users to effortlessly author, and communicate via multi-modal emoticons during everyday messaging ( Figure 1 ). We employ VibEmoji as both a design to study about and a technology probe to study with, generating new empirical knowledge about the usage and user experiences of multi-modal emoticons, and extract relevant implications for future research and design in this domain. Given these explorative research purposes, VibEmoji is designed with several features that extend beyond, or differentiate from, existing design cases. First, in current systems (e.g., MeeTime, Telegram, and iMessage), the emoticon sticker, animation effect, and vibrotactile feedback are often fixed combinations; users cannot re-combine or appropriate these elements to create new meanings. By contrast, in VibEmoji, users are allowed to freely select from and combine three rich sets of elements: pictograph emoticons, animation effects, and vibrotactile patterns ( Figure 1 ). This is to better understand users' creative usage in social communications [53] , and support them to create more nuanced, personalized expressions in a conversation. Second, emoticon recommendation in current systems often tends to automate users' selection (e.g., typing "happy" and will be suggested), which could result in irrelevant suggestions due to misinterpreting the user's intent. Differently, VibEmoji explores a non-aggressive recommendation strategy that eases users' authoring without limiting their options. Each time when the user selects the first element (e.g., an emoticon), VibEmoji predicts the most likely elements to be selected in the remaining two categories (e.g., vibrations and animations), and updates the display order to prioritize these elements on the interface. The prediction is based on two sources of information: (1) emotional relevance between the selectable elements (base on the valence-arousal model [34, 41] ), and (2) frequency of combinations authored by the user. Third, current user-customization of multi-modal emoticons is often done in advance, separated from the moment of use. For instance, users need to pre-define the customizable parameters (e.g., Memoji), or pre-record an animation (e.g., Animoji), so that they could use it later. Differently, we design VibEmoji with the intention to support user-authoring on the fly, during a conversation, so that we could explore how users' creation of multi-modal emoticons can be improvised, based on their real-time feeling and context. To empower the design and implementation of VibEmoji, two questionnaire surveys have been conducted to gather perceived emotional properties of 15 animation effects (52 respondents) and 60 vibrotactile patterns (52 respondents) . This is because adequate datasets remain to be established for vibrations and animations, besides the emotional properties of emoticons collected in [40] . Within VibEmoji, we developed a mobile application and a backend system, which were deployed in a field evaluation with 20 participants in 10 pairs, over the period of four weeks. Each pair of participants is friends or partners who have pre-existing routines of mobile messaging. A mixed-method approach was adopted to gather both quantitative and qualitative data to uncover empirical knowledge about how the participants created, used, and experienced multi-modal emoticons in their daily communications. Based on our exploration, we generalize a set of design implications of user-authoring, multi-modal emoticon systems, to inform future design and research. This work thereby has twofold contributions: (1) a novel mobile system that supports users' authoring and usage of multi-modal emoticons in online communication; (2) empirical knowledge about how people create, share and experience multi-modal emoticons in daily communication, and relevant implications for future designers and researchers. While the most famous early example of emoticons, :-) , has been proposed by Fahman [16] in his email to colleagues in 1982, the earliest use of emoticons in computing systems is dated back to the 1970s (the PLATO system [5] ). While the term emoticon and emoji are often used interchangeably (as this paper also does), the name emoji is originally rooted in the Japanese mobile market (firstly supported in a product by J-Phone in 1997), which literally translates as "facial letters/characters. " As a universal, powerful pictographic language, emoticons/emojis have been widely used to concretely and conveniently convey emotional feelings that would otherwise take more words to articulate. For this reason, a great number of related studies analyzed the usage of emoticons among relatively large-scale samples to understand people's emotions in various contexts: e.g., to detect software developers' emotions in communication [10] , to collect students' emotional states in learning [55] , or to understand people's political attitudes [21] , to name but a few. Besides analyzing affective patterns that can be commonly extracted from people's emoticon usage, another body of research focuses on the contextual and personal aspects, to uncover how people assign new meanings to emoticons: e.g., repurposing or appropriating emojis beyond their originally intended meanings [27, 53] , expanding nonverbal expressions to reduce the dependency of texts [59] , or developing highly customized usage [20] . Studies encompassing emoticons and emotional communication often used theories and models of emotion as their basis for measures and analyses. Overall, two types of theories were most frequently referred to: discrete emotion theories, which generalize emotions into discrete categories [12] , and dimensional emotion theories, which model emotions in continuous dimensions: e.g., valence-arousal model proposed by Russel and Barrett [41] . For instance, Rodrigues et al. established the Lisbon Emoji and Emoticon Dataset (LEED) [40] , in which they collected 505 participants' perception of 238 emoticons, including the dimensions of valence (positive-negative), and arousal (arousing-calm). These emotional properties of emoticons could inform both the analysis of emoji usage and the development of emotional interfaces. LEED was also used in this study in developing a recommendation algorithm for multi-modal emoticons. [18] ). More importantly, current social applications start to support (yet limited) multi-modal features in emoticons (see Table 1 ). These emerging features of popular social applications suggest that emoticons are evolving from simply static visual pictograms into more dynamic compounds of multi-modal elements. Meanwhile, we could expect such multi-modal emoticons to afford more customization space for users to create richer and more engaging ways of expressing themselves and empathizing with others. However, as Table 1 summarizes, current social applications have not fully leveraged the potentials of user-customization in multi-modal emoticons, leaving several meaningful opportunities for new design explorations. First, while animation and haptic feedback are increasingly incorporated, current systems only enable pre-designed, fixed combinations of animation and vibration, either applied on specific emoticons (e.g., WeChat [47], MeeTime [24] ), or provided as rigid options (e.g., iMessage Effects). None of the systems allow users to freely combine different animations, vibrotactile patterns, and emoticons to create new meanings. Second, current systems (e.g., Memoji [2] , Gboard [18] , WeChat) only explored the recommendation of emojis based on users' text input. None of the existing cases has explored how to recommend multi-modal combinations for users. Third, few systems have been designed to facilitate users' customization on the fly. Memoji and Animoji, for example, both require users to pre-define the avatar or pre-record animations to prepare customized stickers, instead of spontaneously creating and sending new multi-modal emoticons according to the unfolding conversation. Although Gboard supports users' mash-up of two Emojis during a conversation, it does not enable on-the-fly modification of animations or haptic feedback. The above opportunities have underlain the core design features of the VibEmoji system, which (1) supports users to freely combine various emoticons, vibrotactile patterns, and animations as multimodal emoticons, (2) provides appropriate recommendations about relevant multi-modal elements, and (3) enables users' on-the-fly customization during a conversation. By embodying these design features that extend beyond existing cases, and evaluating the design in the wild, we aim to extract new empirical knowledge that could advance the development and user experience of multi-modal emoticon systems. To the best of our knowledge, prior work has rarely explored supporting users to author and communicate via multi-modal emoticons. However, rich novel interfaces have been created and studied in the HCI community, to facilitate users' communications via (visual) emoticons or multi-modal signals (e.g., haptic experience). These design cases have served as inspirations for our exploration. A stream of research has been focused on exploring new ways for users to communicate via (pictorial) emoticons or emojis. Opico [28] enables emoji-first communication: users could respond to each other using sequences of emojis, expressing feelings or simple concepts without texts. MojiBoard [1] is a keyboard that eases users' entry of parametric emojis: a series emojis that can convey emphasis or micro-stories. A number of cases targeted fully automating users' process of selecting or sending emojis. ReactionBot [33] captures users' facial expressions and accordingly adds emojis to their text messages on Slack. Another face-to-emoji idea was explored in [13] . Other explorations concerned automating the selection of emojis/emoticons based on emotion keywords [50] , sentences [30] , or speech signals [23] . Voicemoji [56] explored voice-based emoji entry for visually impaired users. A few other studies also tackled challenges related to the accessibility and inclusiveness of emoticons [31, 48] . Increased attention has been paid to user customization of emoticons [19] , for example, in generating new emojis based on users' sketch and text input [35] , or allowing two intimate users to co-customize their emoticon shortcuts (DearBoard) [20] . A few studies suggested the promises of integrating animated representations to enlarge the communication capacity of (static) emoticons. For instance, Animated GIFs were found to afford rich interpretation and nuances in nonverbal communication [26] . However, in GIFs/short videos, the image and motion effects are fixed combinations, whereas in this study, we aim to enable users to freely combine static emoticons with different motion effects. As related, the AniSAM study suggested that adding animated representation to static icons could more effectively visualize emotional states (e.g., arousal) [45] . this suggests adding animated effects might enrich the nuances and expressiveness of static emoticons. Harrison et al. [22] proposed Kineticons, a rich library of kinetic behaviors that can be applied on static GUI elements such as icons, to enable extra communication affordance. Our study utilizes Kinneticons to establish a versatile set of animation components for multi-modal emoticon authoring. Touch, or haptic experience, is an essential type of non-verbal cue for social communications. Besides the abundant haptic studies on supporting users to understand, monitor and operate technologies (e.g., [8, 14, 54] ), prior work has extensively explored technologymediated haptic experience in interpersonal communications. For instance, vibrotactile feedback was leveraged in both mobile and wearable devices, for affective communication in distance, by resembling the feeling of touch, cheek-poke, or handshaking between users: e.g., [4, 37, 38, 52, 58] . A recent case explored supporting mediated touch in face-to-face settings without breaking social distancing rules during COVID [57] . Several studies have also encompassed supporting haptic communication in text messaging [25, 36] . In addition to end-user interfaces, related research has contributed to theories [29] , tools [9, 44] , or evaluation techniques [42] of haptic experience to better support related design and research. For example, VibViz [44] offers a large, diverse library of vibrotactile stimuli for haptic feedback design, based on which we developed the vibrotactile library for VibEmoji. In summary, the prior studies illustrated the values of haptic modality in enabling affective communication and social connectedness [36, 43, 52] ; and a few specifically suggested benefits of combining haptic signals with visual cues [58] . These conclusions supported our decision of leveraging haptic feedback as one of the major components for multi-modal emoticons. In this section, we present the design considerations, a usage scenario, and detailed features of the VibEmoji interface. The design rationales behind VibEmoji are underlain by our explorative research objectives: to extract knowledge about how people create, share and experience multi-modal emoticons in daily communication, and how to better design user-authoring multi-modal emoticon systems. Therefore, as addressed in Section 2, the core features of VibEmoji are designed to extend beyond, or differentiate from existing design cases, to better facilitate the generation of new understandings in this domain. Here we briefly consolidate our design considerations for these core features, before going into the details of the interface design: D1: Enabling users to freely combine multi-modal elements. Existing systems that support multi-modal emoticons only provide a limited number of fixed vibration-animation effects for users to customize their emoticons. To probe how to support deeper customization and richer expressions by users, we design VibEmoji as an open-ended authoring interface that allows for freely combining various emoticons, vibrotactile patterns, and animation effects. D2: Providing recommendation without limiting users' options. Current emoticon systems have explored automating users' selection by recommending an emoticon-based on their textual inputs (keywords). However, to enable users' nuanced, differing expressions, VibEmoji explores an alternative recommendation approach that facilitates, rather than fully automated, users' choice. Instead of recommending the best solution, the interface prioritizes the most relevant multi-modal options based on the users' present selected element, while still keeping all other options available. This supports both the user's efficient authoring of classic combinations and open exploration of new expressions. D3: Supporting on-the-fly authoring of multi-modal emoticons. The user-customization process of existing emoticon systems is often carried out separately from the actual use during online chatting. Users often need to pre-configure emoticon features or dynamic effects in order to use them later. VibEmoji is designed to support users' on-the-spot authoring of multi-modal emoticons during online communication. It streamlines users' authoring process into four steps: selecting three multi-modal elements and press the "send" button (see Figure 1 (b)). With VibEmoji, we are able to probe how users may create ad-hoc, improvised expressions based on the unfolding conversation. Here we present an example of how users could use the VibEmoji mobile application to combine emoticons, vibrotactile patterns, and animation effects to create multi-modal emoticons on the fly, during communication. Anna and Brad are chatting online. Brad said something really funny which has made Anna burst into laughter. Anna wants to fully express how much she enjoyed it, so she decides to send a "laughing tears" emoji and enhance its expression with animation and haptic effects. She unfolds the VibEmoji keyboard, and selects that emoji, as well as a vibrotactile pattern that feels like body shaking, and animation of bouncing up and down. Anna is satisfied with this combination, so she presses the "send" button, and this emoticon enhanced by the animation effect and vibrotactile pattern is sent to Brad's device. Brad receives it and knows that Anna finds what he said funny. He wants to let Anna know that he feels the same and is still laughing about it. So he pressed that emoticon from Anna, and its haptic feedback is rendered again with the animation on both of their devices. Anna feels this response from Brad. As Figure 1 (a) shows, the VibEmoji interface looks similar to an Emoji keyboard, and can be easily unfolded during messaging by pressing a small icon next to the text-input field. But different from a conventional Emoji keyboard which only shows emoticons, the VibEmoji keyboard has three segments to display three types of multi-modal elements: emoticons, vibrotactile patterns, and animation effects (D1). As Figure 1 (b) shows, each segment could be scrolled horizontally to browse its all elements. Users could make selections from the three sets of elements regardless of the selection order (D3), e.g., they could start from an emoticon like Anna did in the scenario, but they could also start with selecting a vibrotactile pattern or animation effect first. Users could press an element to select it, and it will be highlighted on the interface (Figure 1 (b) ). When pressed again, that element will be deselected. In the animation segment, all animation effects loop continually so that users could easily preview them. When no emoticon element is selected, the animation previews will be applied on a default neutral icon; otherwise the animation previews will be rendered using the currently selected emoticon. In the haptic pattern segment, each vibrotactile element is displayed by a thumbnail preview that visualizes its waveform (based on its intensity and sharpness parameters) in a simplistic manner. These thumbnail previews are designed to help users quickly see the general trend and length of a vibrotactile pattern. Each time when selected, a vibrotactile pattern will be rendered by the haptic module of the mobile phone. If there is a selected animation, the vibration will be rendered in synchrony with the animation. This helps users directly preview what their current combination feels like and eases their quick experimentation with different options. To further assist users' seamless creation on the spot, VibEmoji uses an unobtrusive recommendation technique to predict and prioritize the options that are more relevant to the user's current selection (D2). Namely, each time when the user makes the first selection, be it an emoticon, vibrotactile pattern, or animation effect, the display order of the elements from the other two segments will be updated accordingly. The elements that are predicted as more likely to be combined with the current selection, will be put forward on the display, so that they can be more easily found. This recommendation follows two-fold principles. First, elements that were frequently combined with the current selection and sent by the user in the past, will be prioritized by the display. On top of that, elements that have more similar perceived emotional properties to the selection (based on their scores in the valence and arousal dimensions [41] ), will have higher priority on the display. For example, assume that the user's first selection is an emoticon scored relatively high in both valence and arousal, i.e., perceived to convey both "positive" and "exciting" emotions. Then, the elements in the vibrotactile and animation segments which convey similarly positive and exciting feelings will be given higher priority on the display, to better match the present communication intention of the user. This way, VibEmoji offers a non-aggressive recommendation approach: on the one hand, it eases users' selection by prioritizing options that are more often used or have more emotional relevance to the preceding selection; on the other hand, it still guarantees users abundant space to explore and experiment with new multi-modal combinations. The input data and algorithm of the recommendation technique will be detailed in the next section. In this section, we report how VibEmoji is developed to support the aforementioned design features, including the preparation of the multi-modal elements and data collection of their emotional qualities, as well as the scheme and implementation of the system. Each VibEmoji emoticon can be authored by a user by combining three multi-modal elements: a (static) sticker, an animation effect, and a vibrotactile pattern (D1). These stickers, animation, and vibration sets are intended to be open-ended, meaning that users could add new elements to expand each set (e.g., a user could include new stickers to the sticker set). We have prepared rich default elements for the three sets (50 stickers, 15 animations, and 60 vibrations), which affords a great number of combinations possible for creating multi-modal emoticons (also see diverse examples gathered from the field deployment discussed in Section 6.2.1). Stickers: The default set consists of 50 Apple emojis (Version IOS 10.0). We chose Apple emojis as our default stickers due to their frequent usage among mobile users. The chosen 50 stickers (see Figure 1 ) were selected based on an emoji usage survey by the Unicode Consortium 1 , which classified all the Unicode emojis based on usage frequency. We chose the frequently used facial emoji stickers (e.g., not objects, stars, etc.). This set of emoticons has also been targeted in Rodrigues et al.'s survey [40] , due to their usefulness and frequent usage for expressing a variety of emotions. Animations: The default set consists of 15 animation effects from the work of Kinecticons [22] . It provides a set of diverse, open-ended and multi-purpose kinetic effects to be combined with different graphical elements (e.g., icons) to support a wide range of communication purposes [22] . Hence, these animations are not strongly bonded to certain semantic meanings or connotations. This open-endedness is in accordance with our design, which is to combine the animations with different stickers to afford rich expressions. Of the original 39 Kinecticons, we excluded the animation effects concerning two objects, which are not applicable in our case. We also filtered out those that could not be appropriately applied on circular emoji stickers, e.g., because they were designed for rectangular icons or menus. Vibrations: The 60 default vibrotactile patterns were chosen from the work of VibViz [44] , a diverse vibration library intended for the design to convey information through the haptic channel in digital devices. VibViz was chosen also because of its potential possibilities to be combined with different stickers and animations to convey various meanings. Of the original 120 vibrations, we excluded the ones that were longer than 10 seconds, because they are not suitable for short, emoticon-based instant-messaging scenarios. We further filtered out vibrations that had less emotional relevance to the default set of stickers in terms of valence and arousal (further explained in Section 4.2.2). As mentioned, our design rationale is to grant great freedom for user authoring while keeping the authoring process intuitive and effortless (D1 and D3). To achieve this, we use a recommendation algorithm that predicts most likely combinations based on the selected element, and accordingly updates the display order of the un-selected sets, to ease the choices of users. Such prediction is based on two kinds of data: the perceived emotional properties of each element (i.e., valence and arousal [41] ), and the user's historical usage data. As the prerequisite of the recommendation algorithm, the datasets of perceived emotional properties need to be established beforehand. For the sticker set, the perceived emotional properties of the core collection of Apple emojis have been already measured by Rodrigues et al. [40] (with a sample size of 505), and this measured collection has covered the frequently used emojis we chose based on the Unicode survey. We therefore utilized their open-sourced data for the algorithm. However, for the animations and vibrations, adequate datasets remain to be established. We thereby conducted questionnaire surveys to gather the data. A web-based questionnaire survey was conducted with 52 respondents to gather the perceived emotional properties of the 15 Kineticon animations. Kineticons were designed with a rich vocabulary of kinetic behaviors to convey diverse intentions and meanings [22] . However, no prior work has assessed what emotional feelings these animations could trigger from users, which is crucial for emotion-based recommendation. The questionnaire was specifically developed for data collection using Javascript and Node.js, which renders each animation effect on a separate page and asks respondents to use 7-point Likert scales to rate the valence and arousal of each animation (Supplementary Material A). The animations are rendered using the neutral default icon used in the original user study of Kineticons [22] . The order of the animations was randomized for each respondent. The respondents were recruited from the Amazon Mechanical Turk (MTurk) platform (aged from 25 to 54; 27 males, 24 females, and 1 other). Each survey session took 10-20 minutes from a respondent. The compensation for each respondent is $2.5. 1 invalid responses were excluded due to incompleteness. In the end, 52 responses were used (each of the 15 animations was rated by 52 respondents) as the prerequisite data of the animation set (see Figure 2 ). Vibrations. The original VibViz study results consist of the valence and arousal scores of each vibrotactile pattern rated by three researchers [44] . However, no survey has been conducted so far, to collect more data about users' perceived emotional properties of the VibViz patterns, which had become part of our development tasks. Besides excluding long vibrations (>10s), we filtered out vibrations that were relatively farther to an emoticon sticker from the sticker set in the valence-arousal space. These more distant vibrations had less emotional relevance to the chosen emoticon stickers, therefore, they are deemed less likely to be combined together with the chosen stickers. For each sticker in our sticker set, we marked the 5 closest vibrations in the valence-arousal space, and accumulated the mark frequency for each vibration. In the end, there were 60 variations marked more than once, which resulted in our final vibration element set. A second questionnaire survey was conducted with 52 respondents for the emotional perception of the vibrotactile patterns. Due to the fact that respondents need to access a device with a haptic engine to experience the vibrotactile patterns, we developed an ad-hoc survey application on the iOS platform using Swift UI (Supplementary Material B). To avoid each survey session getting too long and causing fatigue or boredom in respondents, we split the 60 vibrotactile patterns into two surveys (30 for each). Each survey took 15-20 minutes for each respondent. In the survey application, to rate each vibrotactile pattern, the respondent was asked to first press a button to experience the vibration before its Likert scales appeared. This button was available on-screen throughout the rating of that pattern, so that the respondent could replay the vibration anytime when needed. Due to the consideration of device access, respondents were only recruited through word of mouth, which were required to have access to an iPhone 8 or above (to make sure the haptic engine works smoothly). The survey app was distributed to the respondents through both local build and TestFlight 2 . In the end, 52 qualified respondents (aged from 18 to 59; 35 males and 17 females) completed the survey and each vibrotactile pattern was rated by 26 respondents (see Figure 2 ). The results are shown in Figure 3 . For added values, the most frequent response is more fun in the communication (65 times), followed by a few other answers. These responses inspired the questionnaire developed for our field study (see Section 5.2), to see whether similar opinions would emerge from lived experiences with multi-modal emoticons. In terms of meaningful scenarios, the most frequent option is real-time messaging via mobile phone (76 times). This confirms that mobile messaging is a meaningful scenario to start exploring multi-modal emoticons. As demonstrated in Section 3.2, to ease the authoring process of multi-modal emoticons, VibEmoji offers an unobtrusive recommendation to predict and prioritize the multi-modal emoticon elements (i.e., stickers, animations, and vibrotactile patterns) that are more relevant to the user's current selection (D2). Initially, before any user selection, all the three elements have default display orders on the VibEmoji interface. This mimics the convention of most emoji keyboards where the location of each emoji is fixed to facilitate users with consistent access based on their memory. Once a selection is made by the user, the undecided elements are ranked based on their association to the selected element by the user. Given an undecided element in one modality , and a selected element in another modality , the recommendation algorithm considers the following two essential factors to obtain a ranking score for . First, elements that have more similar perceived emotional properties to the selected element are prioritized. This is because it is less appropriate to combine elements that are quite different in their emotional properties, which may result in expressing contradictory emotions in a single VibEmoji emoticon. Based on the gathered data described in Section 4.2, we simply use the Euclidean distances between elements in the valance-arousal 2D plane, like those in previous studies [40, 44] . Specifically, the perceived emotional similarity is defined as: where , and , represent the valence and arousal values of and . Second, elements that were frequently combined with the current selection and used in the past are prioritized. This is based on the assumption of recently visited items will be likely revisited again, which is widely used in a range of recommendation systems such as Google Search, Amazon, etc. However, simply counting the frequency of and used in combinations overlooks the overall usage patterns. We thus employ a TF-IDF approach [39] that is commonly used in information retrieval. The TF (term frequency) reflects the frequency of and used together with respect to being used with all the elements in the modality . That is, where ( , ) is the number of occurrences that and are combined in VibEmoji emoticons by the user in the past. Further, the IDF (inverse document frequency) reduces the bias that the element is used too often with all the elements in the other modality , which promotes the diversity of the recommendation. Thus, where is the number of elements in . Therefore, the final TF-IDF score is - . We then use a weighted score to compute the ranking score for with respect to by combining the two factors: where and are weights. In our development, we set = 0.6 and = 0.4 based on our empirical observation. We compute this ranking score for every element ∈ and sort the scores descendingly to obtain the display order of the elements on the VibEmoji interface. When there are only one selected element (e.g., a sticker), we reorder the elements in the other two modalities (e.g., vibration and animation) based on the above ranking method. When there are two selected elements (e.g., a sticker and a vibration), we simply use the average of the two ranking scores to obtain the final score of each element in the third modality (e.g., animation). In this section, we introduce how we built the VibEmoji system in detail. As shown in Figure 2 , the system contains a front-end mobile application to support on-the-fly authoring of multi-modal emoticons, and a back-end server to perform data processing, messaging, and communication. The front-end application was implemented based on the React Native framework, because it can be easily adapted to other platforms in the future (e.g., Android), in addition the Apple iOS platform this project has been built upon. To have fine-grained control over how the vibrotactile patterns are played on the device, we developed a custom React Native bridge to expose the native iOS haptic API. Our implementation enabled the app to play and pause haptic patterns defined in AHAP (Apple Haptic and Audio Pattern) files, which were identical to the ones used in surveys. For the animations, we recreated the selected Kinections (see Section 4.1) using react-native-reanimated. We used react-native-gifted-chat to implement the front-end messaging application, which provides a comparable look and feel to other messaging applications (see Figure 1 ). The back-end server was developed using Node.js, which is responsible for relaying messages between users and delivering notifications. To increase the robustness of the system, the server stores a queue of undelivered messages when the receiving party is offline, and re-sends these messages once the users are re-connected. Specifically, notifications on iOS are managed using APNs (Apple Push Notification service). We used socket.IO for sending and receiving messages between the front-end and the back-end for reliable real-time communication. To enable sending and receiving VibEmoji emoticons, we created a custom encoding in plain text behind the rendered multi-modal emoticons. For each emoticon, the front-end sends the encoded string to the back-end which relays the message to the receiving party. Then, the front-end on the receiving side re-renders the message based on the encoded string. Using VibEmoji as both a design to study about, and a technology probe to study with, we conducted a four-week field evaluation to gather empirical knowledge about how people create, share and experience multi-modal emoticons in daily online communications, and thereby surface relevant design implications for future research and development. Twenty participants (in 10 pairs) were recruited for a four-week field evaluation. The participants were aged 19-36 (md=29, iqr = 10.5), with 12 females and 8 males. They are residents from three countries: Canada (10 participants), China (8 participants), and the Netherlands (2 participants). Each pair of participants had already established a social relationship with each other; namely, they had known each other for 10 months to 22 years (md=4 years, iqr = 2). Participants are referred to as P# in the following text, and consecutive numbers indicate the pairs (e.g., P1 & P2; P3 & P4). Among the 10 pairs, nine of them were friends, and one pair was significant others (P15 and P16). More details about the participants could be found in Supplementary Material C. Participants were recruited via the word of mouth based on two basic requirements: (a) they already use an iPhone 8 or above as the primary mobile phone so that the built-in haptic engine ensures the same sensorial quality of the designed vibrotactile elements; and (b) they can be paired with another person that they already know and have regular communications via online applications. In the beginning of the study, each participant installed the VibEmoji mobile application on their mobile phone through Apple TestFlight. To ensure that they could experience the system in a naturalistic manner, they were asked to use the application in their daily lives to communicate with each other as they already did prior to the study. No structured sessions of chatting were required by the researchers. Neither were the participants asked to use the system exclusively: they were encouraged to use VibEmoji, while they could still use their existing online communication tools at the same time. It was reminded that some risky information might need to be avoided when using the application to chat, such as bank accounts or passwords. Some common and safe topics were provided as examples: including updates about each other's recent life, recent news from the public media, or planning future activities together. Meanwhile, it was made clear to the participants that these were just examples, and they could discuss any topics they desired. Each participant was compensated $15 after their participation. Given our highly explorative goal, we adopted a mix-method approach to gather both quantitative and qualitative data in various forms during the field evaluation period. The details are as follows. Unstructured feedback. During their whole period of participation, the participants were encouraged to share their experience and thoughts to the researchers anytime as desired, through multiple types of communication channels, such as voice calls, voice messages, or textual messages. Along with such feedback, they also used screenshots or screen recordings to share some usage examples of multi-modal emoticons that they found meaningful or interesting. These quick and unstructured feedback sessions were transcribed and annotated for analysis. System logs. Participants' usage logs stored in the back-end were partially used for generally understanding how they interacted with the interface. These retrieved system logs included the number of messages or VibEmoji's they had sent, as well as their operations on the VibEmoji interface: e.g., when and what they pressed on the interface while selecting the multi-modal elements to construct a VibEmoji. To be noticed, the textual contents of participants' chat history were not directly retrieved by researchers for analysis, unless the participants proactively shared a certain part of their chats in unstructured feedback or in final interviews. Questionnaire. Towards the end of their participation, each participant was provided with an online questionnaire via Google Forms, asking about their general experiences of using VibEmoji. In the questionnaire, we asked about their general styles of constructing multi-modal emoticons: e.g., in which order they made selections across the three types of multi-modal elements, and whether they had discovered several classic combinations and reused them in chat. Moreover, we asked about their general experience of VibEmoji in comparison with the conventional emoticons, in terms of conversation engagement, fun, expressiveness, etc. (see Figure 4 ). These data were meant to gather their high-level experience and general opinions which could serve as supplementary or triangulation to the in-depth interview data. Semi-structured interview. A semi-structured interview was conducted for each pair, to gather vivid, in-depth empirical data about how they used VibEmoji to create and share multi-modal emoticons, and what their detailed experiences were. Each interview took 30-45 minutes and was structured in three sections. • Section 1 (8 minutes): The pair was asked to describe their general experience of using VibEmoji. They were also asked to give more explanation about the general experiences they rated in the final questionnaire. • Section 2 (15 minutes): The pair was asked to provide concrete examples about the multi-modal emoticon they created and considered meaningful in the specific context. In addition, they were asked to reflect on their detailed workflows, such as how they selected different elements to construct new multi-modal emoticons, and how they experience the various design features. • Section 3 (7 minutes): To probe their latent needs about multimodal emoticon systems, the pair were asked to envision what important features the next version of VibEmoji should have. To probe future design opportunities, the pair was asked to envisage what future scenarios can benefit from multi-modal emoticons, in addition to instant messaging. In the end, an open discussion is encouraged for participants to share additional thoughts. All interviews were audio-recorded and transcribed verbatim for a thematic analysis along with the unstructured participant feedback data. 6 RESULTS AND ANALYSIS 6.1 Quantitative Results 6.1.1 Questionnaire Responses. RS1 to RS6 in Figure 4 show the 20 participants' overall experiences with the VibEmoji emoticons, in comparison with the static emoticons they had been using before (on a 7-point Likert scale). In general, they considered that the multi-modal emoticons had contributed to the engagement (RS1), fun (RS2), and expressivity (RS3) of the conversation, and helped them to express more accurate (RS4), and a wider range of feelings (RS5), and increased their mental closeness (RS6) to each other: with all medians residing on 5 ( Figure 4) . As shown by the detailed rating distribution in the figure, it seems that users gave higher ratings (i.e., 6 or 7) towards multi-modal emoticons' contributions to the engagement, fun, expressivity, and mental closeness, in comparison with their contributions to the accuracy or the range of feelings. Moreover, VibEmoji affords great freedom for users to combine various multi-modal components. To generally probe whether such freedom of exploration was meaningful, the questionnaire asked about their general style of usage: i.e., RS7 and RS8 in Figure 4 . As the figure shows, overall, the participants tended to try something different (variations of combinations) when authoring multi-modal emoticons in chatting (RS-7: md=5.5). Meanwhile, they also tended to reuse a few combinations (RS-8: md=5.5). This suggests that while being explorative in use, the participants had also formulated a few frequently used expressions (see examples in Section 6. The open-ended design of VibEmoji allows users to flexibly decide the order of selecting multi-modal elements. When asked about participants' habituated order, 19 preferred stickers to be the first element to choose, while only one preferred starting from vibration. Divergence emerged among the users, in terms of the second element they preferred to select. While 13 participants would go for animations, six would select a vibration, and the other would go for stickers. More analysis of the users' selection order and decisionmaking in the authoring process is discussed in Section 6.2.2. During the whole period of the field evaluation, the participants sent 1,824 textual messages and 581 multimodal emoticons using VibEmoji: approximately every three text messages were accompanied with one VibEmoji. Per pair of participants, the median number of textual messages sent is 86 (iqr = 162, min=21, max=954), and for multi-modal emoticons, the median is 47.5 (iqr = 49, min=15, max=182). The logs also provided a rough insight into the time spent for the users to author a multi-modal emoticon. We defined an interaction timeframe that started from the users' first operation on the authoring interface to the moment of pressing the "send" button. The median of the interaction timeframes is 7.09 seconds (iqr = 13.01). A slight difference could be seen, when comparing the timeframes of the first 10% (md=9.6 seconds, iqr = 18.1) VibEmoji emoticons sent with the later 90% (md=7.1 seconds, iqr = 12.7) across all pairs. This may indicates a learning stage in the beginning, and the users became faster after familiarized with the interface. Moreover, given that users had been actively exploring different variations throughout the study (RS7 in Figure 4) , these timeframes should also reflect their time spent for exploration and experimentation. A thematic analysis was conducted to analyze the qualitative data gathered from the semi-structured interviews with the participants, as well as their unstructured feedback provided to us throughout the field study. This inductive analysis method was chosen due to our purpose of establishing a set of structured, systematic meanings (themes) [3] about how the participants created, shared, and experienced multi-modal emoticons in the wild with the VibEmoji system, in order to generate rich empirical knowledge for future research and design. Our analysis followed the six-phase procedure detailed by Braun and Clarke [3] , which included familiarization with data by organizing notes and annotations across the dataset, generating initial codes based on research objectives, searching for themes by establishing connections among codes, reviewing and finalizing themes, and re-contextualizing the themes to formulate findings. In this section, we address these qualitative findings, in light of some quantitative results reported in the prior section. For instance, as Table 2 shows, P3 once asked P4 if she went to sleep already. P4 said not yet, but she would sleep soon, and then she sent a multi-modal emoticon with the sleepy face Emoji . To enhance this Emoji, she used an animation and a vibration that both could be associated with the behavior of sleep: the animation is spinning slowly and slightly, and the vibration "felt like snoring. " Both P3 and P4 thought that this combination was expressive and made the conversation more fun. The next example is from P15: when chatting with P16, she augmented the Emoji face blowing a kiss with vibrations that felt like the sounds of kissing but had different duration. She further explained that she would use a shorter vibration to represent a light kiss, and a longer vibration for a long kiss. As another example, P7 sent P8 a VibEmoji based on the emoji smiling face with heart-eyes , to express the excitement of seeing a celebrity. To enhance the expression, she used the animation of rapid shrinking and expanding, and a vibration felt like heartbeats. Scenario Type (ii): Creating New Meanings beyond Original Emoticons. In these examples, the participants flexibly combined the multi-modal elements to convey new meanings that were additional to, or different from the original meaning of the used emoticons. And oftentimes, such new messages or expressions cannot be simply conveyed by the base emoticon alone. One example was reported by P12, in which he asked P11 to play an online game but did not get a response. He then sent a multi-modal emoticon combining an animation of rapidly waving from side to side, with the Emoji grinning squinting face , and a vibration that matched the animation. He selected this "vibrant" motion in order to "urge him [P11] to reply", whereas the Emoji was not selected for specific reasons and just to afford a positive connotation. Another creative case was shared by P16 (Table 2) . He appropriated the meaning of the Emoji zipper-mouth face by combining it with a "naughty" and rapid motion, and an intense vibration, which both felt like "struggling, to express that he did not want to stop talking. As he explained, "This emoji [zipper-mouth face] originally means 'shut up', but [with the selected animation,] now it has become 'I don't want to shut up'. It then had extra meaning. " More such examples were offered by the participants. For instance, P17 combined the Emoji hugging face with a motion of bouncing up and down, to convey a "creepy" impression when joking with P18. P2 combined the Emoji grimacing face with an animation and a vibration that felt like "trembling with cold" when discussing the weather with P1. Above examples demonstrate how users' could creatively construct new expressions and assign new meanings in daily communications when they can freely combine multi-modal emoticons, which confirmed our D1. Scenario Type (iii): Setting Atmospheres for Unfolding Conversation. Participants found they could construct multi-modal emoticons to (re-)set the atmosphere for a conversation session. In these examples, the prior sent multi-modal emoticons which left on the chat display continued contributing to the ambience of the communication, and reminding conversation partners of certain vibes during a chat session. A concrete example was from P15. She had a particular multimodal emoticon for starting a joke, which included the emoji winking face with tongue , an animation of zooming in and out, and an intense vibration ( Table 2) . As she described, "Every time I want to make fun of [P16], or joke with him, I will send this. It's like an opening, so that he won't get mad at me. It sets the tone for the chat [...] a reminder that I'm joking. " P9 also mentioned that he used the multi-modal emoticons to create relaxing vibes: "I sent VibEmoji [emoticons] because I don't want the conversation to be formal or distant. I hope to have some humor, or closeness in the atmosphere. " Such continuous rendering of a certain atmosphere was partially attributed to the animation components of the multi-modal emoticons, which were continuously visible to users as long as the emoticon is displayed. As P7 reflected, "the animation itself created a sort of atmosphere [...] It's playful [...] hard to describe, but it's contagious from person to person. We both [P7 and P8] could get it. " Scenario Type (iv): Constructing Attentive, Empathetic Responses. regards constructing multi-modal emoticons to formulate attentive, empathetic responses. In these examples, the participants created multi-modal emoticons as spontaneous responses to their partners in the unfolding conversation. As experienced by their conversation partners, these responses constructed on the spot could convey more attentive and empathetic feelings than responses via simply standard or preset emoticons. P8, for example, was once listening to P7 talking about her work. During listening, she also responded to P7 by sending two multimodal emoticons that both included the emoji thinking face , but had different animations and vibrations which all felt relatively slight and calm ( Table 2) . As she explained, compared with using a static emoticon, or sending the same multi-modal emoticon twice, the spontaneous variations she created on the fly could better convey the meaning "I am listening with care", since "it shows that I have received your messages and I give you different response each time, rather than simply copying [the same responses]. " And this meaning was indeed conveyed to P7: "[P8] sent me the stickers of thinking, with slowly flipping or spinning [...] felt like she was still thinking and thinking about different things I said [...] She was expressing 'I am listening' and she did not want her expression to be interrupting [...] compared with static stickers, this response is more interactive, instead of repeating. It creates richer nuances and feels more friendly. " Another example was from P3 and P4 discussing the surgery of P4's father. They both sent a VibEmoji emoticon built upon the same emoji . What made them feel empathized with each other was that they also used a very similar vibration to express sadness: "It felt nice that we used the same vibration [...] we shared the same feeling [...] we both thought this vibration could express sadness. " Such empathetic experience was also reported by other pairs. For example, P1 and P2 appreciated the "tacit understanding" (P2) or the "the sense of empathy" (P1) between them, when they used similar animation or vibration element to express similar feelings in the conversation: "we could associate with same emotions [...] don't know why but we just can [...] it's not fully conscious." In similar cases, P5 also experienced "resonance" with P6. As put by P8, when responding to each other using multi-modal emoticons with certain similar components, "it feels like we are in sync [...] a sense of being connected. " Above examples suggest the great values of multi-modal emoticons in enriching users' spontaneous non-verbal responses to each other, and enhancing their sense of connectedness and interactiveness in the conversation. As shown in the examples, these benefits could only be afforded when users are able to conveniently and flexibly construct multimodal emoticons on the fly, during the unfolding conversation, which contextually confirmed D3. 6.2.2 Theme 2: Users' Authoring Process-How They Select and Interpret Multi-modal Elements. This theme elaborates on participants' process of creating multi-modal emoticons. Namely, we summarize patterns from participants' rich explanations concerning their preferred orders and strategies of making selections, as well as their (common and differentiated) interpretations of the multi-modal elements. To the best of our knowledge, no prior knowledge has been gathered regarding how users value and prioritize the different types of components when constructing a multi-modal emoticon. To probe a better understanding of this, we designed the VibEmoji interface to be open-ended: users could make selections across the three types of components in any order. Which Element to Select First (and Why): According to the users' reports (from both the interview and questionnaire data), a recurrent pattern of their selection order can be seen: in many cases, users would select an emoticon first, and then an animation a vibration (although meaningful exceptions were also reported). Detailed accounts have been provided by them. First, as agreed by the participants, emoticons often served as the base to set the tone for their multi-modal expression. As P5 stated, "emoticons decide the theme of the emotion to express", or, as P18 put, "emojis are the major carrier of what I want to express." In contrast, animations and vibrations were often used as nuanced modifications or further enhancement to the base meaning afforded by the emoticons. As they reasoned, this could be because emoticons are more direct and explicit, while animations and vibrations are more open-tointerpretation: e.g., "when texting, I already associated my feeling with some emoji" (P2) but "when selecting from vibrations, I do not have a particular one in my mind" (P20). Compared with animations, vibrations are even less concrete, since they are "more subtle" (P19) and "not so visual" (P20). However, according to the participants, sometimes, animations could have the first-place contribution to the expressed meaning. For instance, P9 gave an example that he first decided to use the animation moving left and right which felt like "nudging him [P10] to wake him up" to ask for P10's response. Then he chose the emoticon only because it was fun. A few similar examples were also told by other participants. In addition, P3 reported a different pattern: oftentimes, her second step is selecting a vibration, before looking for an animation. An extra criterion mentioned by several participants is that they sometimes tended to select animations and vibrations that could better match each other's temporal properties: such as rhythm or periodic duration. e.g., P15: "I would make the vibration match the motion. [P16 added:] it's like dubbing a movie. " Vibrations were Associated with Everyday Sounds, while Animations were Associated with Body Language: Another noticeable pattern that emerged from the users' descriptions is how they used associations to explain their decision-making on selecting vibrations and animations. Namely, when reasoning about why they chose certain vibrations, they often associated their choices with certain everyday sounds; whereas, when explaining their choices for animations, they often used associations with body language. For instance, P3 explained that when constructing a multi-modal emoticon to convey greetings to P4, she selected a vibration that felt like the sound of "knocking on the door." P14 used a vibration to express a strong sad feeling because it felt like an "electrical noise. " As another example, P15 explained that she chose a vibration that felt like the "tick-tock" sound of a clock, to match the slow circular animation she selected. P9 also mentioned an example in which his selected vibration reminded him of the funny background music both he and P10 were familiar with. Other examples also included associations between vibrations and the sounds of snores, heartbeats, or human voices. Besides these, rich examples were gathered about how the animations were interpreted as body language. For example, P19 explained that she combined an animation of swinging left and right with a not-so-happy face to express disagreement because it felt like "shaking head left and right, disagreeing. " P17 added an animation of slight and quick shaking to the emoji grinning face with sweat to express the meaning of embarrassment, since the motion resembled the "face twitching when someone is embarrassed." As both P17 and P18 agreed, the animations could make the expressions "specific and concrete [...] more like adding body movements. " P9 similarly confirmed that when looking for animations, he did take "body languages" into considerations, either consciously or subconsciously. Diverse Individual Interpretations: In addition to the common pattern of associating vibrations and animations respectively with sounds and bodily movements, individual differences also widely existed in the participants' interpretations. For instance, while P4 mentioned that she preferred to choose longer, more intense vibrations which felt richer and more interesting to her, P20 thought that "long vibrations feel angry. " For another example, while P12 considered that bigger motions could be used to represent happier feelings, P19 was concerned that bigger motions sometimes could appear too exaggerating and less honest. As P5 pointed out, such differences: "could have something to do with [a user's] personality, as well as the topic of the conversation." These individual differences in users' decision-making again, confirmed that it is meaningful to ensure a certain amount of open-endedness in design and allow users to freely combine various multi-modal emoticons based on their personal and contextual interpretations (D1). This theme summarizes users' experiences and comments on the recommendation approach of the VibEmoji system, including the two-fold recommendation principles and the non-aggressive recommendation strategy. Why the Two-fold Principles were Meaningful: The VibEmoji recommendation algorithm follows two-fold principles. One principle is to prioritize the multi-modal options that were more emotionally relevant (in terms of valence and arousal levels) to the user's selected element. The participants considered this principle to be intuitive and oftentimes meaningful. As confirmed by P15, "it's pretty accurate, and it's similar to what I thought [...] for active and happy emojis, I tend to add faster and bigger motions." P8 also thought that "stronger emotions match stronger vibrations, weaker emotions match weaker vibrations. " They considered this principle to be helpful especially when users did not want to spend much time browsing the options. As stated by P19: "some users are lazy, like me. I would use the recommendation. The options on the first screen are already sufficient" The other principle is to prioritize the elements that have been more frequently combined with the current section by the user in the past. The participants perceived this to be practical and efficient. For instance, to P17, "historical frequency is pretty helpful, it's quite often that I reused some combinations." And P3 felt the frequencybased principle "pragmatic" and "smart": "it's personalized, so it's about my own feelings. " In our design, the initial recommendations are based on emotional relevance, since no user data are gathered. Yet, over time, the frequency will shape the interface more and more, so that it would be adaptive to each user's personal preferences. This design decision has been explicitly confirmed by the participants. As P5 pinpointed, the emotional relevance could provide meaningful guidance to the users especially when "in the beginning, no pattern [of use] has been formed", and after a certain time, it is reasonable to "meet each user's different needs, and be more personalized. " A very similar comment was also given by P16. Advantages of Non-aggressive Recommendation: Following D2, the VibEmoji interface features a non-aggressive recommendation strategy that prioritizes the potentially relevant options, without automating users' decisions. In the interviews, the participants commented on this feature in comparison with some relatively more "assertive" recommendations they experienced from similar systems. For example, an emoji keyboard that predicts the "best" result to automate users' selection based on their text input. Overall, the participants had recognized two major advantages of the non-aggressive recommendation strategy over the more assertive ones in terms of multi-modal emoticons. First, the nonaggressive recommendation ensures users' personal expressions. As P18 put, "if everyone is using this [recommended option] then why do I have to use it? I want to express my own emotions [...] I won't send the same one each time. I could try something new." By prioritizing relevant options, users could be facilitated while still preserving sufficient variations for creating their own expressions. Second, an algorithm might not be successful all the time, due to its limited access to the whole conversation context. As P18 argued, a recommendation "might not always be precise in predicting what you really want to express. " As telling examples for this, in Section 6.2.1, the participants reported a series of cases about how they combined multi-modal elements creatively to express meanings that are quite subtle, or different from the original meaning of the selected emojis. In this sense, a non-aggressive recommendation strategy could avoid giving arbitrary or irrelevant suggestions to users, while leaving space for users to make the modifications. These experiences thereby contextually verify our D2. Beyond Mobile Messaging. This theme summarizes participants' envisions about desirable future scenarios or new application domains that multi-modal emoticons (emoticons augmented by animation effects and vibrotactile patterns) could be meaningfully leveraged, in addition to the mobile context. Interestingly, a number of their envisages applications went beyond human-human communication. For instance, P11 envisaged that in an AI-enhanced smart home system, a voice agent (e.g., Cortana by Microsoft, or Alexa by Amazon) could also utilize multimodal emoticons to communicate with human users: "apart from the voice interactions, if it could send me emojis with animation and vibrations [e.g., on mobile devices], it would feel different, it would feel more human, and richer information could be communicated.." P3 also envisaged a smart home context, in which the Internet-of-Things products could more expressively interact with users via multi-modal emoticons, and she thought a smartwatch could be a meaningful medium to receive vibrotactile signals. P4 from the same pair talked about a public setting: "In some shopping malls, there are now AI robots for searching foods and places. If these robots could display emoticons with animations and vibrations, they would be more attractive to younger users I guess. " Moreover, P15 suggested using a multi-modal emoticon to summarize people's attitudes from internet reviews, e.g., of a restaurant: "instead of reading all the reviews, you could then first feel the general reaction of people who went to eat, from an emoji with animation and vibration. " P7 and P16 similarly envisaged the usage of multimodal emoticons in combination with the "barrage captions" in live-streaming, or live videos (a type of dynamic captions that are created by viewers in real-time and moving across the video to enable momentary communications between the live-streamer and viewers or among viewers). As P16 states, "current live barrages are mainly texts, if dynamic emoticons and vibrations are added, it might be more fun and interactive." In addition, both P5 and P12 envisaged that future authoring tools of multi-modal emoticons might consider supporting users to tap the touchscreen to define new vibrotactile patterns, or drawing a trajectory to create a new animation to further unleash users' creativity. This paper focuses on multi-modal emoticons: a trend burgeoning in recent social applications which enhances traditional static emoticons with animations or vibrotactile feedback. Multi-modal emoticons are increasingly introduced in current social applications. However, little empirical knowledge has been accumulated concerning how people create, share and experience multi-modal emoticons in daily communications. Addressing this opportunity, we have designed and evaluated the VibEmoji system to gather empirical understandings from real-world usage and experience, and surface design implications for supporting users' creativity and communication of multi-modal emoticons. As shown in our field deployment, VibEmoji enabled a userdriven exploration. In various scenarios, the participants creatively combined stickers, animations, and vibrations to author multimodal emoticons on the fly, which helped their spontaneous affective communication in the unfolding conversation. As indicated by the results, using multi-modal emoticons could make online chatting more fun, engaging, and expressive, increase their mental closeness, and potentially help them to communicate more specific, or richer emotional feelings. These experiences have been further supported and concertized by the participants' detailed descriptions in the qualitative findings. Beyond presenting a novel design case and contextually confirming the benefits of multi-modal emoticons, the rich empirical data also enabled us to have an in-depth analysis of the participants' creation and usage of multi-modal emoticons. We discuss the extracted underlying patterns and implications for future research and design below. Implication 1: Purposefully leveraging the extra affordance of multi-modal emoticons. The variety of examples collected in the findings illustrate that by freely combining multi-modal elements, users were not only able to enhance or augment the expression of static emoticons, but also create new meanings that are additional to, or different from the intended meaning of static emoticons. This suggests that multi-modal emoticons can offer extra communication affordance to existing emoticon systems, implying new research and design opportunities. Studies have been conducted to explore the idea of emoji affordance on how users could build upon the visual or contextual properties of an emoji to enable richer meanings [1, 28, 53] . However, little has been done to understand and leverage the communication affordance of multi-modal emoticons. While serving as an early exploration, our work has surfaced that the multi-modal elements like animations and vibrotactile signals can facilitate users' communication beyond simply as subsidiary feedback to specific emoticons (which is how they have often been designed in current messaging apps). Instead, just like new meanings could emerge from the combination of multiple emoticons [28] , our results suggest that new expressions could be created by combining static emoticons with different animations and vibrations. We thereby argue that future designs could purposefully help users leverage this new affordance to meaningfully expand their nonverbal communication vocabulary. To do so, a user-led approach is needed: researchers and designers could continually curate and analyze examples of users' creative expressions via multi-modal emoticons (e.g., Section 6.2.1), and intentionally facilitate similar use cases in design. Implication 2: Designing for immersive and empathetic experience in online nonverbal communication. As uncovered by our empirical data, another promising opportunity of multimodal emoticons is that they could create a certain atmosphere for a chat session, or make their responses feel more empathetic and attentive to each other. These new experiences in nonverbal communication were not supported by static emoticons. For instance, P15 always used a specific multi-modal emoticon to set the mood for joking, and P9 used multi-modal emoticons to render friendly, and relaxing vibes. Such immersive, and "contagious" ambience was, to a large extent, enabled by the animation effect which was continuously rendered as long as the emoticon is displayed on the chat screen. The future design could build upon this experiential quality and make online communication even more immersive: e.g., when a new multi-modal emoticon is sent, the ambient visual elements on the chat screen (e.g., chat bubbles or background) will react to it and adjust to the same atmosphere (e.g., a happy multi-modal emoticon will make the ambient elements look brighter), until the next emoticon shapes the ambience again. As reasoned by the participants, the attentive, empathetic responses were enabled partially because they could create small variations of animations or vibrations in the multi-modal emoticons, instead of sending receptive or preset emoticons. Whereas, responding to others with the repeated static emoticon might feel in-polite and in-attentive. The empathetic experience was reported to be further enhanced when the two conversation partners used similar animation or vibration to express similar feelings, which made them feel more "connected" and "in sync". Inspired by this, the future design could explore enhancing such empathetic experience, e.g., by creating a resonance, or echo effect on the screen, when two users used similar multi-modal elements in chat. Implication 3: Designing new animations and vibrotactile patterns based on body languages and everyday sounds. The great creativity exhibited by our participants in authoring multimodal emoticons was supported by the building blocks: the sets of animations and vibrations prepared upon the open libraries of Kineticons [22] and VibViz [44] . Despite being comprehensive and well-designed, these libraries were intended to support a wide range of design purposes in graphical or haptic interfaces, rather than multi-modal emoticons. For this reason, as mentioned in Section 4.1, not the whole libraries could be meaningfully used in VibEmoji. Hence, a meaningful question in this emerging domain is: how we can design new building blocks, such as libraries of animation effects or vibrotactile patterns, to better nourish the creativity of both designers and everyday users of multi-modal emoticons. Our findings from Section 6.2.2 could shed light on this question. Participants often associated their choice for vibrations with certain everyday sounds (e.g., knocking on the door, snoring, or tick-tock). Moreover, their choice for animations was often associated with bodily movements or body language (e.g., nodding, pushing, face-twitching). When there was a match between their communication intention and the association of the element, they would choose that element (e.g., P19 used a "nodding animation" to express agreement or hospitality; P3 used a "snore vibration" to express going to sleep). This interesting pattern suggests that new animations and vibrotactile signals could be designed to enable users' richer associations with the two categories. For instance, future work could design new vibration elements based on a typical set of everyday sound effects. Or, designers could refer to a taxonomy of (bodily) nonverbal signals [51] to design new animations that could communicate more intentions of users (e.g., resembling certain eye behaviors or arm gestures, in addition to the head poses or full-body motions that they often mentioned). Implication 4: balancing the design trade-off of automation versus autonomy. To support user-authoring multi-modal emoticons, VibEmoji features a non-aggressive recommendation approach, which prioritizes relevant options for users without limiting their free experimentation. As appreciated by the participants, such an approach leaves enough agency for them to create personalized expressions and avoids giving arbitrary and irrelevant suggestions due to misinterpreting the user's intention or the context. However, we do recognize that a design tension exists, concerning the tradeoffs between automation and autonomy. For instance, a number of prior works explored fully automating users' entry of emoticons based on facial expressions [13, 33] or speech [23] . While automation often means low effort and high efficiency, our study revealed that in certain social communication scenarios, users value the autonomy and richness of self-expression more than efficiency. As reported earlier, while considering the recommendation principles, the users enjoyed being explorative and exploring different combinations. Therefore, an important consideration for future work is to better understand users' different needs for autonomy and automation in various social communication scenarios and carefully balance the two in design. Limitations. Our work is not without limitations. First, VibEmoji included the same 50 (frequently used) emojis for each user to ensure that they had the same building blocks to begin with. Although abundant examples of their creative usage have been found, users could have more freedom if they could import new stickers to the system. A future field study can be conducted to enable each pair to co-design their libraries, in the form of the co-customization [20] for multi-modal emoticons. Second, participants in this field study were aged from 19 to 35; hence, how different age groups will use multi-modal emoticons and to what extent the findings might be applicable beyond the current user segment remain to be answered. It would be interesting in the future to involve older age groups or even elderly users, to understand their experiences. Additionally, the VibEmoji user interface might be further polished: its size could be made a bit smaller to display more historical chats above, and a search bar could be added to enable element selection by keyword. Lastly, VibEmoji currently employs a recommendation algorithm that only considers the usage frequency and patterns of one single user (i.e., the benefiting user). With more data collected, advanced recommendation techniques, such as collaborative filtering [32] , can be achieved by mining common patterns across all the users and suggesting frequently-used multi-modal emoticon combinations of the crowd that is similar to the benefiting user. In this study, we set out to empirically understand multi-modal emoticons: an emerging phenomenon in current messaging applications which enhance static emoticons with multi-modal effects such as animations and vibrations. In doing so, we built VibEmoji, a userauthoring multi-modal emoticon system for mobile messaging. The design of VibEmoji extends beyond the current systems and allows us to extract new understandings about how users create, share, and experience multi-modal emoticons in daily communications, and how we could better support them by design. We evaluated VibEmoji with 20 participants in naturalistic settings over the period of four weeks. Rich findings were gathered to contextualize the creative usage and meaningful scenarios of multi-modal emoticons, as well as the users' detailed process of authoring. Based on the rich data, we generalize relevant design implications to inform future work in better supporting users' creation and communication of multi-modal emoticons. MojiBoard: Generating Parametric Emojis with Gesture Keyboards Use Memoji on your iPhone or iPad Pro Using thematic analysis in psychology InTouch: A Medium for Haptic Interpersonal Communication Emoji Use At All-Time High 5 Billion Emojis Sent Daily on Messenger ActiVibe: Design and Evaluation of Vibrations for Progress Monitoring Kirigami Haptic Swatches: Design Methods for Cut-and-Fold Haptic Feedback Mechanisms Emoji-Powered Sentiment and Emotion Detection from Software Developers' Communication Data The semiotics of emoji: The rise of visual language in the age of the internet Are there basic emotions? Face2Emoji: Using Facial Emotional Expressions to Filter Emojis The hapticon editor: a tool in support of haptic communication research New Emoji Requests from Twitter Users: When, Where, Why, and What We Can Do About Them Customizations and Expression Breakdowns in Ecosystems of Communication Apps Mediating Intimacy with DearBoard: A Co-Customizable Keyboard for Everyday Messaging Emoji Use in Twitter White Nationalism Communication Kineticons: Using Iconographic Motion in Graphical User Interface Design Emojilization: An Automated Method For Speech to Emoji-Labeled Text Exploring Embedded Haptics for Social Networking and Interactions Understanding Diverse Interpretations of Animated GIFs Characterising the inventive appropriation of emoji as relationally meaningful in mediated close personal relationships. Experiences of technology appropriation: Unanticipated users, usage, circumstances Opico: A Study of Emoji-First Communication in a Mobile Social App Defining Haptic Experience: Foundations for Understanding, Communicating, and Evaluating HX Pictogram Generator from Korean Sentences Using Emoticon and Saliency Map Examining the "Global" Language of Emojis: Designing for Cultural Representation Advances in collaborative filtering ReactionBot: Exploring the Effects of Expression-Triggered Emoji in Text Messages Basic dimensions for a general psychological theory implications for personality, social, environmental, and developmental studies Photo-Realistic Emoticon Generation Using Multi-Modal Input Exploring Affective Communication through Variable-Friction Surface Haptics Remote Handshaking: Touch Enhances Video-Mediated Social Telepresence POKE: A New Way of Sharing Emotional Touches during Phone Conversations Data Mining. Cambridge University Press Lisbon Emoji and Emoticon Database (LEED): Norms for emoji and emoticons in seven evaluative dimensions Core affect, prototypical emotional episodes, and other things called emotion: dissecting the elephant HapTurk: Crowdsourcing Affective Ratings of Vibrotactile Icons Toward Affective Handles for Tuning Vibrations VibViz: Organizing, visualizing and navigating vibration libraries AniSAM & AniAvatar: Animated Visualizations of Affective States Emoji Accessibility for Visually Impaired People What are Animoji? How to create and use Apple's animated emoji Emoticon Recommendation System for Effective Communication Social signal processing: Survey of an emerging domain Keep in Touch: Channel, Expectation and Experience Repurposing Emoji for Personalised Communication: Why Means "I Love You SemFeel: A User Interface with Semantic Tactile Feedback for Mobile Touch-Screen Devices UIST '09) Using Student Annotated Hashtags and Emojis to Collect Nuanced Affective States Voicemoji: Emoji Entry Using Voice for Visually Impaired People Touch without Touching: Overcoming Social Distancing in Semi-Intimate Relationships with SansTouch VisualTouch: Enhancing Affective Touch Communication with Multi-Modality Stimulation Hello Emoji: Mobile Communication on WeChat in China. Association for Computing Machinery We thank Dr. Wei Li from the Human-Machine Interaction Lab, Huawei Canada, for his support and contribution to this project since its very early stage. We thank all our participants for their time and valuable inputs. We would also like to thank our reviewers whose generous and insightful comments have led to a great improvement of this paper.