key: cord-0582355-a3vk5i1f
authors: Wang, Xingyao; Jurgens, David
title: An animated picture says at least a thousand words: Selecting Gif-based Replies in Multimodal Dialog
date: 2021-09-24
journal: nan
DOI: nan
sha: ca4c772bd828ee30d0a7dd51b708405c97486781
doc_id: 582355
cord_uid: a3vk5i1f

Online conversations include more than just text. Increasingly, image-based responses such as memes and animated gifs serve as culturally recognized and often humorous responses in conversation. However, while NLP has broadened to multimodal models, conversational dialog systems have largely focused only on generating text replies. Here, we introduce a new dataset of 1.56M text-gif conversation turns and introduce a new multimodal conversational model Pepe the King Prawn for selecting gif-based replies. We demonstrate that our model produces relevant and high-quality gif responses and, in a large randomized control trial of multiple models replying to real users, we show that our model replies with gifs that are significantly better received by the community.

Conversations are central to many online social platforms. While most conversations are textbased, computer mediated dialog also affords alternative forms of communication, such as emoji or stickers like bitmoji, that allow users to express themselves (Tang and Hew, 2019; Konrad et al., 2020) . Increasingly, these visual forms of communication have become common in social media (Bourlai and Herring, 2014; Highfield and Leaver, 2016) , with a notable use of the reaction gif (Bakhshi et al., 2016; Miltner and Highfield, 2017) . These gifs are short video sequences that depict a particular scene and sometimes contain text that acts as a meta-commentary (Eppink, 2014) . As a result, conversations become multimodal where individuals reply to one another using combinations of text and gifs ( Figure 1 ). While conversational AI systems have been developed in a purely textbased setting, such systems do not capture the full multimodal behavior seen online. Here, we study multimodal conversation by introducing new dialog models for selecting gif replies in conversation.

PizzaMagic: Ahhhhh!!! The EMNLP deadline is in 24 hours!! CasualModel: Figure 1 : Gif responses in conversation like the one shown above are embodied dialog that use visual imagery to convey reactions and emotions. This paper develops a system to select the appropriate gif response to messages. (PDF best viewed with Adobe Acrobat) Conversation analysis is central to NLP and multiple approaches have analyzed this dialog structure (Jurafsky et al., 1998; Pareti and Lando, 2018; Cohn et al., 2019) and developed conversational agents to engage with people (e.g., Fang et al., 2018; Xu et al., 2020; Hong et al., 2020) . Recent work has focused on generating open domain social chatbots that engage in sustained conversations in a natural way (Ram et al., 2018) . Because many of these systems are designed to support voicebased dialog, they overlook non-textual forms of interaction used in social media conversations. In parallel, multimodal NLP systems have been developed for image data, often focusing on image-totext tasks such as image captioning (Melas-Kyriazi et al., 2018; Sharma et al., 2018) and visual question answering (Antol et al., 2015; Huang et al., 2019; Khademi, 2020) . More recent work has focused on the reverse text-to-image dimension, such as generating an image from a description (Niu et al., 2020; Ramesh et al., 2021) . Our work unites these two strands of research by integrating imagebased communication into conversational agents.

Our paper offers three main contributions. First, arXiv:2109.12212v2 [cs.CL] 29 Sep 2021 we propose the new task of selecting gif responses in multimodal conversation analysis and introduce a new dataset of 1,562,701 real-world conversation turns with gif replies. Second, we introduce a new model PEPE THE KING PRAWN that fuses image and text-based features to select a relevant gif response. In in-house experiments, we show that our model substantially outperforms strong baseline models at selecting the exact gif used in real data and, in a manual test of the quality of the best responses, achieves an nDCG of 0.8145 on the annotated test set. Third, in a real-world test, we deploy our model as a part of a large-scale randomized controlled trial and show that the gif replies produced by our model are more highly voted by the community. Data, code, and models are available at https://github.com/xingyaoww/gif-reply.

Gifs have been widely adopted in communication as a natural form of embodied speech where the visual imagery conveys emotions or a reaction as a response (Bakhshi et al., 2016; Tolins and Samermit, 2016) . These gifs commonly come from widelyknown cultural products, such as movies or television shows, which provides common knowledge for how they could be interpreted (Eppink, 2014; Miltner and Highfield, 2017) . However, a single gif may have multiple interpretations, depending on the context, cultural knowledge of its content, and the viewer (Jiang et al., 2017) . As a result, a single gif can serve multiple functions in communication (Tolins and Samermit, 2016) .

Gifs have grown in their use through increasing affordances by platforms like Tumblr, Reddit, Imgur, and Twitter that allow gifs to be natively displayed like text in conversation threads (Jiang et al., 2018) . Further, gif-based keyboards have been introduced that allow users to search for gifs that have been tagged with keywords or other metadata (Griggio et al., 2019 ). Yet, these technologies require that gif data be prepared with sufficient tags to be searchable or to have sufficient data to use collaborative filtering techniques for recommendations (Jiang et al., 2018, p.9) . As a result, there is a clear gap in identifying appropriate response gifs directly from the text, which this work fills.

Despite the widespread use of gifs, no standard dataset exists for text and gif replies. Further, al-though platforms like Twitter support gif replies, these gifs are not canonicalized to identify which responses correspond to the same gif. Therefore, we construct a new dataset for this task by collecting responses, matching their images, and augmenting this data with metadata about the gif, where possible. A visual description of the whole procedure can be found in Appendix Figure 7 .

Gifs have many uses (Miltner and Highfield, 2017) and so we use a two-step approach to collect data that focus specifically on those likely to be used in conversation. First, gif responses are collected from Twitter by identifying all replies to Englishlanguage tweets containing animated_gif as embedded media. Tweets were collected from a ∼10% sample of Twitter from March 13th, 2019 to Jan 24th, 2020, totaling 42,096,566 tweets with a gif that we were able to retrieve. Twitter does not canonicalize its gifs so two separate gif files may actually have the same imagery. Further, these files may not be identical due to small differences such as color variations or aspect ratios. To identify uses of the reference gifs, we use Average Hash from the imagehash library to create low-dimensional representations of each gif where hash distance corresponds to perceptual distance. Since gifs are animated and may contain varying scenes, we compute the hash for the first, middle, and final frames, concatenating these into a single hash. Two gifs are considered the same if (i) they have identical hashes or (ii) their hamming distance is < 10 and gifs with that hash have been used more than 500 times in Twitter. This latter condition was selected after manual evaluation of thresholds to trade-off between increasing the size of the training data and reducing potential noise caused by matching error. A visual example of this process can be found in Appendix Figure 8 .

Not all gif responses in the Twitter data are conversational or appropriate for wider re-use. Therefore, we filter these responses to only those gifs whose imagery matches gifs hosted by the Giphy website, which is the backend for many gif-based keyboards. Giphy contains a wide collection of gifs that are curated to remove content inappropriate for general use (e.g., violent or sexual imagery). Gifs on the platform are categorized (e.g., "reaction" or "celebrities") and we identify 28 categories containing 972 keywords likely to contain gifs used Figure 2 : The frequency distribution of gifs in our data roughly follows a log-normal distribution, with a few gifs used often, while a long tail of gifs are used relatively infrequently. in conversation. A total of 2,095,993 gifs linked to those keywords were ultimately retrieved and stored as image hashes. Additional details of categories and keywords are in Appendix B.

After the matching image hashes to filter replies, we identify 115,586 unique gifs, referred to as reference gifs, and 1,562,701 tweet replies using one of these gifs, which forms our official dataset. Figure 2 shows these gifs' frequency in the data; much like words, a handful of gifs receive widespread use, while a long tail of gifs are rarely used.

We augment our gif data with information about their content. Some gifs have text that transcribes what a person is saying in the gif's scene or is a meta-commentary on the content. This text is extracted using paddleOCR (Du et al., 2020) . Since some gifs are long enough to contain multiple utterances, we run OCR on four frames sampled from each quartile of the gif's length. Roughly 50% (58,020) of gifs contain at least one extracted word from the selected frames, with an mean of 5.5 extracted words per gif across the dataset.

Second, some gif repositories like Giphy allow users to tag gifs with information on their content or theme, e.g., "face palm" or "movie." We collect tags for the 115K reference gifs used in Twitter, obtaining 39,651 unique tags. These user-generated tags were moderately noisy due to orthographic variations like spelling, capitalization, and spacing. Therefore, we merge tags by (i) lower-casing the text and (ii) performing a manual merge for similar word forms (e.g., "excited" and "exciting"). To minimize noise, we retain only tags that have been used with at least five gifs and where those gifs have been used at least 1000 times in total; this process removes many low-frequency tags that are either overly-specific or idiosyncratic in their use.

Finally, we performed a manual inspection of all remaining tags to remove tags that are too general (e.g., "emotion") and retain only noun, adjective, and verb tags (words or multi-word expressions) that describe specific emotions or actions. A total of 241 unique tags were retained (Appendix C). 6.0% of gifs have at least one tag associated with them (mean 1.9 tags). However, these tagged gifs account for 38.7% of the replies in our dataset, suggesting tags are only available for more-popular gifs. Our dataset represents roughly an order of magnitude more data and more tags than the closest related dataset of Chen et al. (2017) that contained 23K gifs with 17 manually-curated emotions.

We introduce a series of models for producing a gif response in conversation. Each model will select a gif from the 115K gifs in our dataset as a response to a text-based message. This task is related to but distinct from work on image-text matching (Lee et al., 2018) , which aims to find an image describing a piece of text, or text-to-image (e.g., Wen et al., 2015; Xu et al., 2018) , which generates an image from a text description. Here, we aim to select gifs that reflect natural continuations or reactions to a message in a dialog, akin to how gifs are used in social media. For all models, additional details on the training procedures and hyperparameters are provided in Appendix A. The three models that follow use varying degrees of information about the gifs and text to select a response.

The first model uses tags as a shared representation for characterizing gifs and text. Analogous to how object tags are used as anchor points for image-text matching (Li et al., 2020) and pivot languages are used in machine translation (Cheng et al., 2017) , we use tags to bridge information between the text in a tweet and the visual content of a gif. Here, each gif becomes associated with a set of tags describing its conversational functions and for each text, we predict the set of tags for gifs responses to it-in essence, predicting what types of responses are most appropriate. We describe both of these processes next and how gifs are ultimately selected.

Estimating Gif Tags Only 6.0% of the gifs in our data have associated tags. Therefore we train a neural model to predict tags using known tags as training data. To capture any changes in emotion or imagery across the gif, we make separate predictions for four frames sampled across the gif (the same used in §3.2). Each frame is passed through an EfficientNet-based (Tan and Le, 2019) GIF encoder, shown in Figure 3 , to extract a lowdimensional feature vector from each frame. These frame embeddings are fused using the attention mechanism from a transformer encoder layer. The output of the transformer feeds into a fully connected layer, which is trained as a multi-label classifier using binary cross-entropy to predict which tags should be present. Predicting Response Tags for Text For each message, we predict the k-hot distribution of tags for a gif response by training a BERTweet model (Nguyen et al., 2020) , which has been pre-trained on a large corpus of Twitter data (shown as "Tweet Encoder" in Figure 3 ). The model with an additional fully connected layer is trained as a multilabel classifier using binary cross-entropy, using the tags for the gifs used in reply (if known). Tag-based Gif Selection At inference time, given a message, we use the text-to-tag model to predict a k-hot distribution over tags. Then, we select the gif whose estimated tag distribution is closest in Euclidean distance.

The second model uses an end-to-end training approach based on the architecture of OpenAI CLIP (Radford et al., 2021) . The architecture features two encoders, one for text and one for images. During training, the encoders are updated using contrastive loss that maximizes the cosine similarity of paired image-text representations and minimizes the cosine similarity of random pairs of images and texts. We replicate the CLIP architecture and training procedure, using BERTweet to encode text and EfficientNet (Tan and Le, 2019) to encode a composite image of four frames from the gif (compared with BERT and ResNet in their implementation). While originally designed to select an image for a text description, our model is trained to select a gif reply for a text message-a more challenging task than the image retrieval task used in the original CLIP setup, as the message may not contain words describing elements of the gif. At inference time, given a tweet, we use the trained tweet encoder to extract its representation and compute its cosine similarity with each encoded representation for our gifs. The gif with the highest cosine similarity is returned as the best response.

Our final model, KING PRAWN 1 (referred to as "PEPE".) selects gif responses by using a richer set of multimodal features to create a gif representation. Rather than encode the gif solely from its image content, we use a multimodal encoder that captures (i) any text it might have, (ii) the types of objects present in the gif, and (iii) object regions as visual features. We encode these gif aspects using an OSCAR transformer (Li et al., 2020) to create a unified representation, shown in Figure 3 (bottom). Object names and regions of interest feature vectors are extracted using a pre-trained bottom-up attention model (Anderson et al., 2018) .

As input to the OSCAR encoder, the captions to each of the gif's four frames are concatenated together with an "[INTER_FRAME_SEP]" separator token. We filter object areas detected by the bottomup attention model (Anderson et al., 2018) and we keep all objects with probability >0.5. We then concatenate object names together with the same inter-frame separator between names of different frames. Together, the caption text, object names, and image-region features are fed into the OSCAR transformer encoder to generate a GIF feature vector; the transformer is initialized with the default OSCAR weights. We use BERTweet to encode text. The entire PEPE model is trained end-to-end using contrastive loss, similar to the CLIP model.

We initially evaluate the methods in two ways. First, we use traditional classification-based evaluation, testing whether the models can reproduce the observed gif replies. However, some messages could have multiple valid gif responses. Therefore, as a second test, we evaluate the model in a retrieval setting, measuring whether its most-probable responses are good quality for a message. Experimental Setup Models are trained and tested on a dataset containing 1,562,701 Tweet- GIF pairs associated with 115,586 unique gifs, where 605,063 tweet-gif pairs are associated with at least one tag. Using the finalized 241 unique tags as classes for multi-label classification, we split the dataset by stratify on tags using the iterative train-test split method provided by scikit-multilearn library (Sechidis et al., 2011; Szymański and Kajdanowicz, 2017) to create a 80:10:10 train, dev, and test split which is finalized to train the models described in §4. Following BERTweet (Nguyen et al., 2020), we preprocess tweets in our dataset using NLTK TweetTokenizer for tokenization, emoji package to translate emotion icons, and converted mentions and links to special "@USER" and "HTTPURL" tokens.

To test whether each model's predictions are valid responses, we annotate the ten most-probable gif predictions for a subset of the tweets in our test data. Many tweets in our test set require substantial context to understand due to having few tokens, linking to URLs that provide extra knowledge, mentioning other users in directed communication. These factors suggest social context or general knowledge aids in the recipient's understanding of the gif's intentions. While the model can still benefit from training on such examples, judging the appropriateness of response is difficult without access to the social context. Therefore, to reduce interpretation ambiguity, we annotate only tweets without URLs or user mentions and having at least 10 tokens. This process selects tweets with sufficient content to judge appropriateness independent of the larger social context.

Two annotators (the authors) were shown a list of potential gif responses for a tweet and asked to judge whether this is an appropriate gif response (a binary rating). Gifs were selected from the ten most-probable replies for each system and collectively shown in random order to prevent knowing which system generated each reply. A total of 2,500 gif-tweet pairings were annotated. Annotators attained a Krippendorf's α of 0.462; while moderate agreement, this value is expected given known differences in how people interpret and value gif responses based on their familiarity with its content, message interpretation, and life-experience (Jiang et al., 2018) . We follow the evaluation setup from other retrieval-based dialog systems (e.g. Yu et al., 2021; Kumar and Callan, 2020) and use normalized Discounted Cumulative Gain (nDCG), which measures whether more appropriate gif responses are ranked higher. A gif's appropriateness score is the sum of annotators' ratings. Results The PEPE model was able to identify relevant and good-quality gif responses, as shown by its performances on the test data (Table 1) and annotated data (Table 2) . Performance on the test set is expected to be low, given the challenge of identifying the exact gif used for a tweet when multiple possible gifs are likely to be equally valid. However, the PEPE model is still able to identify the exact gif (out of 115K) in its top 10 predictions for 3% of the data, substantially outperforming all other models. Performance on the annotated data (Table 2) provides a more realistic assessment of whether models can generate high-quality replies, as it measures whether the models' replies themselves were good. The PEPE model attains substantially higher performance (p<0.01) than other models. While the CLIP variant model performs well, the contentagnostic Distribution sampling baseline performs nearly as well. This baseline's high performance speaks to the multiple interpretations of gifs and the ease at which readers can make connections between a gif and message. Indeed, even the randomgif model has a non-zero nDCG, highlighting the ability for an arbitrary gif to still be considered appropriate. We speculate that popular gifs may be popular because of this ease of multiple interpretations. Table 4 shows the top predictions for models and baselines for two example messages, illustrating the variety of relevant gifs; the PEPE and random baseline replies for the second message exemplify the type of gifs that can be widely applied to many messages, often to humorous effects. Ablation study PEPE fuses multiple types of input, which may uniquely contribute to model's ability to select gif replies. To understand how these inputs each contribute, we performed an ablation study on the annotated test set by removing one input from Oscar GIF Encoder shown in Figure 3 (i.e., a gif's caption, object names, or objects' visual features) Table 3 : Results for ablated versions of PEPE where specific input is removed (cf. Table 2) show that all input forms contribute to the ability to select replies. and evaluating the model's resulting gifs on the same test instances.

The ablated model performances, shown in Table 3, reveal that each input is useful for selecting gifs. 2 Object features capture visual information about what specifically is present in the gif (beyond the discrete names of what is present, e.g., "person" or "building") and show that multimodality is important for high performance-predicting replies just from a gif's caption and categorized content are insufficient. Similarly, the caption of a gif (if present) is important, as the text can help make explicit the intended interpretation of a gif.

To test the generalizability of our models and quality of their responses, we conduct a large-scale randomized controlled trial (RCT) that has the models respond to real users and measure their perception of reply quality. 3

Gifs were posted to the Imgur platform, which is a highly active social media community that supports both image and text-based interactions. On Imgur, users may create posts, which contain one or more images with optional commentary, or comment on posts or replies. Similar to pre-2018 Twitter, comments are limited to 140 characters. Imgur conversations are threaded and frequently contain both image and text comments. Like Reddit, users may upvote and downvote content, providing a score of how well it was received by the community; we use

Tag-based CLIP variant PEPE Dist. Samp. Random That wonderful feeling you get when you arrive to a business dinner that you're supposedly paying for...and realize you've forgotten your credit card I'm convinced some of y'all don't get laid this score in our experiments to evaluate quality.

Our experiment focuses on generating Gif-based replies to top-level text comments (comments made directly to the post). This setup mirrors the conversational data our models were trained on. Imgur supports several ways of filtering its stream of posts. To ensure that our replies have sufficient visibility, we select posts that have already receive 10 comments and appear in the "most viral" sorting. From these posts, we reply to the top-rated text comment. The RCT runs from 8 AM to 8 PM (local time), making at most 10 replies per hour.

Not all topics or comments are suitable for automated responses and great care was taken to prevent potential harm to the community. Through multiple rounds of testing which replies would be responded to, we curated a list of keywords that could lead to potential controversial replies, such as terms about religion or race (full list in Appendix D). Any comment containing a token or lemma matching a word on this list is excluded and not replied to. As a further safeguard, experimenters monitored all replies to remove any that were deemed inappropriate. See the Ethics Section ( §9) for a longer discussion of safeguards.

The field experiment consists of five arms, corresponding to the three trained models and the two baseline models. During each trial, one model is selected and generates a response; the trained model replies with the most probable gif. 4 Not all models are equally likely to perform well and so to make the most use of our trial budget, we use Thompson sampling (Russo et al., 2018) to randomly select which arm of the trial to use. Thompson sampling builds a probability model for the estimated reward of each arm (here, the score a reply receives) and samples from the model such that higher-rewarding arms are sampled more frequently. As a result, this method can provide tighter estimates for the reward of the most useful arms. Scores in Imgur have a skewed distribution, with few comments receiving very high scores and most receiving near the default score (1). Therefore, we use Poisson Thompson sampling. Some comments may be downvoted to receive scores below zero, so for simplicity, we truncate these scores to 0.

We initialize the reward estimates for our experiment by selecting one of the five models in a round-robin manner to reply to an Imgur comment for 3 days. These initial scores act as priors for Thompson sampling to update Poisson distributions for each model. In the trial, we choose a model by sampling from the up distributions using all previous days' scores as the prior. The experiment ran from April 15th, 2021 to August 30th, 2021, and models generated a total of 8,369 replies.

To evaluate the results of the RCT, we construct a Negative Binomial regression on the dependent variable of the score received for a model's reply, truncating negative scores to zero. The Negative binomial was chosen instead of Poisson due to over-dispersion in the score variable. The models are treated as a categorical variable, using the random model as a reference. Since the score will depend, in part, on the attention received by the parent post and comment (higher-rated comments are displayed first), we include linear effects for the post and parent comment. Finally, we include five text-related variables to control for the con- tent of the parent comment: the topic distribution (Appendix Table 9 ) from a 10-topic model (dropping one topic due to collinearity), the sentiment and subjectivity of the message estimated using TextBlob library, the length of the comment, and whether the comment contained a question.

The field experiment demonstrates that the PEPE model is able to generate significantly higherscoring responses. Figure 4 shows the Negative Binomial regression coefficients for the three models and empirical distribution baseline, with the random gif model as a reference; full regression results are shown in Appendix Table 6 . The PEPE model substantially outperforms all other models (p<0.01) in this real-world setting. Surprisingly, despite performing second-best in our annotated evaluations, the CLIP model performs worst, with its replies receiving fewer upvotes than the two baselines that randomly select gifs. We investigate potential explanations for these performances next. The Random and Distributional-sampling baseline models perform surprisingly well relative to models that take the text and gif content into account, with only the PEPE model outperforming them. The performance of the random baselines matches prior work showing people are still able to draw some connection between their interpretation and the reply (Madden, 2018, p.29) . Further, we observed that, when the model's reply truly seemed random, some users replied say they upvoted solely because they enjoyed the gif.

As a follow-up experiment, we tested whether models could be getting higher (or lower) scores by repeatedly picking the same gifs that are skewed towards a positive or negative reaction. Figure 5 shows the score distribution for the top ten most fre- Table 7. quently used gifs (visual examples in Appendix Table 7) for each of the three trained models and reveals surprisingly divergent behavior for how the community reacts. Each model had a different set of most-used gifs, indicating the models did not converge to a universal set of common replies. Indeed, a gif's frequency-of-use and mean reply score were uncorrelated in all three models (r ≈-0.01, p>0.73 for all models). The most-used gifs for each model had average scores that were positive, but the distributions for each gif show that some uses were occasionally downvoted. This high variance in scores indicates that a gif's intrinsic qualities are not solely responsible for the received score and, instead, appropriate use in context is plays a significant part in community reception.

We examined whether models relied on the same set of gifs. Figure 6 shows the distribution of gif uses by each model, indicating that the tag-based model relied frequently on a small set of gifs. However, the PEPE and CLIP variant models were substantially more varied, indicating they draw from the long-tail of possible gifs. Do any of our models spark more subsequent conversation? We fit a separate Negative Binomial regression on the total number of comments made to our reply, using the same IVs as the score regression and include the reply's score itself as another IV. This model (Appendix Table 8 ) shows that both the distributional-sampling baseline and PEPE models produced replies that led to fewer subsequent comments (p<0.01)-despite the PEPE model having the most-upvoted replies. However, the score of the gif reply was positively associated (p<0.01) indicating that more appropriate replies do receive more subsequent conversation. We speculate that the random models may have led to more conversation due to users replying to express confusion about why the particular gif was used. This result points to a need to understand what text and visual factors in gifs influence the volume of subsequent dialog and an opportunity to optimize gif models for both quality and number of conversation turns.

This work draws upon two strands of research from dialog systems and multimodal NLP. Conversational dialog systems have traditionally been built upon large-scale dialog corpora from social media platforms (Bessho et al., 2012) such as Twitter. Our approaches are fundamentally information retrieval based systems that mirror the approach by text-based conversational systems that retrieve ex-isting messages from a large social media corpus as potential replies and rank these to select a response. Our work mirrors models that use neural networks for ranking (Yan et al., 2016; Inaba and Takahashi, 2016; Penha and Hauff, 2021, e.g.,) ; however, we note that many recent knowledge-grounded and open domain models use encoder-decoder methods to improve versatility and applicability (e.g., Ghazvininejad et al., 2018; Gao et al., 2019; Zhou et al., 2020) . Generative approaches are likely inappropriate for gif-based conversation as gifs are more akin to mimetic artifacts that build on cultural knowledge (Eppink, 2014), making synthesizing a new gif from scratch likely less effective.

All three models used here rely on joint embedding spaces for gif and text. Multiple works in NLP have been proposed to align these representations (Kiros et al., 2014; Wang et al., 2016) , often for particular applications such as visual question answering (Antol et al., 2015) . Recent work has focused on embeddings these media with a single encoder that takes both text and images as input (e.g., Wang et al., 2019; Chen et al., 2020), in contrast to our model that uses separate image and text encoders ( Figure 3) ; these multimodal encoders are prohibitively computationally expensive to use in our setting during inference time, as the model would need to be run on each gif (and message) to rank replies, compared with our model that only needs to encode text. However, performance and efficiency improvements in aligning image and text representations would likely benefit our task.

People like using gifs in online conversations-gifs are a fun and playful way to communicate. However, modern NLP conversational agents operate only by text. Here, we introduce a new dataset of 1.56M conversation turns using gifs, including captions and metadata, and develop a new conversational model PEPE THE KING PRAWN that selects appropriate gif responses for messages through comparing encoded gif and text representations. In two evaluations, we show that PEPE is able to generate highly-relevant gif responses and in a largescale RCT, we show that the gif replies from the PEPE model received significantly higher scores from the general public. Our work demonstrates the opportunity for using NLP methods to successfully engage in multimodal conversations.

The interactive nature of the RCT necessitated a close consideration of ethical issues (Thieltges et al., 2016) . Prior to beginning the RCT, the study team obtained IRB approval to interact with users. While necessary in the legal sense, IRB approval is not sufficient to justify the ethical grounds of the study. The primary risks of the study are if the automated models respond with an inappropriate gif or respond to a message that is not suitable for automated response (e.g., discussing the death of a loved one or making an offensive statement). These risks were mitigated in multiple ways throughout the dataset construction and field experiment.

First, the selection criteria for which comments we reply to was designed to only reply to content that was already deemed appropriate by the community. By selecting only posts that had received sufficient upvotes to be called "viral" and were already receiving comments, we mitigate the risk of engaging in topics or conversations that are inappropriate according to the norms of the Imgur community, as these posts would be removed by moderators or would have received sufficient downvotes to stay in obscurity.

Second, by focusing on the top-voted comment to these posts, we again reply to content that has already been deemed high-quality by the comment. This comment-level criteria substantially lowers the risk of our models commenting on inappropriate comments (e.g., a comment insulting another user), as these comments are readily downvoted by the community prior to our intervention.

Third, we employed extensive filtering to avoid replying to any comment containing a potentially sensitive topic, e.g., a discussion of race or trauma (keywords are listed in Appendix D). The initial set of keywords was developed through examining potentially sensitive topics and then iteratively added to by simulating which messages our RCT would reply to and examining whether it would be appropriate. During the field RCT, experimenters continuously monitored the comments to ensure no harm was being done. Ultimately, only three comments were removed during the initial two days, which was due to a bug in the lemmatization and these comments should have been filtered out by our earlier criteria; these comments were removed quickly and we did not observe any notable response from the community.

Fourth, one risk is replying with an inappropri-ate gif, which is mitigated by the use of Giphy to seed our initial gifs. As this platform is curated and does not host objectively offensive gifs (e.g., overly-violent content), our initial gif set is relatively free of objectionable gifs. Because our model learns directly from gifs' frequency of use, unless objectively offensive gifs are widely used, they are unlikely to be deployed from our RCT; we speculate that few objectively offensive gifs are widely used and, in practice, we have not identified any during the study period or when examining hundreds of random gifs in our data (or used in the RCT). Finally, one risk is that by learning gif responses from observed data, our models may reinforce cultural stereotypes that are encoded in the gifs themselves (Erinn, 2019), e.g., the association of African American individuals with strong emotions. While our gif data is relatively clean of overtly offensive gifs, we acknowledge that our model likely does inadvertently perpetuate some of these latent biases in the data. However, the success of our model suggests a future mitigation strategy for platforms suggesting gifs: as biases become known, our approach can be used to suggest less-biased gifs as potential responses to mitigate future harm. 

Following, we provide additional details on how each of the three models was trained.

EfficientNet-based Tag Classifier Gifs are reshaped to 224 by 224 pixel while keeping the aspect ratio by padding and normalized to a mean of 0.5 and standard deviation of 0.5 for each channel before feeding into the EfficientNet-based model. We selected unique GIFs from the finalized dataset that has at least one associated tag and using the iterative train test split on k-hot tag representation to select 5% of those GIFs for validation. The Effi-cientNet tag classifier was trained for 100 epochs on a batch size of 32, using AdamW optimizer with learning rate 1e-5 and weight decay 1e-3. The best validation performance was achieved at the 40th epoch with macro-f1 of 0.30 in predicting 241 multi-label classes. Early experiment shows that transformer encoder layer (macro-f1 of 0.30) out performs linear layer (macro-f1 of 0.19) in fusing multi-frame gif features on the development set, therefore transformer encoder layer is used to fuse features of different frames in our implementation. Tweet-to-tag classifier Using the finalized dataset mentioned in §3, we use tweet as input, and the k-hot tag representation of that tweet instance as ground truth label to train the multi-label classifier along with the tweet encoder for 241 classes. Additionally, we filter out tweets from the finalized dataset that do not have corresponding twitter tags before training. The model with the best validation performance is selected to perform subsequent evaluation and field experiments. The tweet encoder was trained for 100 epochs with a batch size of 32. The learning rate was set to 1e-5 with 1e-3 weight decay using AdamW optimizer. The best validation macro-f1 was 0.07 achieved at the 70th epoch.

The evaluation performance for model selection is measured by nDCG. For every tweet-gif pair in the validation set, we measure the top 30 predicted GIFs from the model using the tweet as input. The relevance of an occurring ground truth gif in the top 30 predictions given a tweet is set 1 for the nDCG calculation. CLIP variant is trained on the same finalized dataset using contrastive loss. It was trained for 16 epochs with a batch size of 16 using AdamW optimizer of learning rate 1e-5 and weight decay 1e-3. Best validation performance is achieved at epoch 6 with an nDCG value of 0.015.

We replace the Transformer encoder layer with a linear Layer on Efficient GIF Encoder from Figure 3, and use this as our GIF Encoder for the CLIP variant. Image inputs to the GIF encoder are normalized following the official CLIP implementation.

The PEPE model follows most configurations from the CLIP variant model, but replace the EffcientNet GIF encoder with an Oscar GIF encoder based on Oscar pre-trained multi-modal transformer (Li et al., 2020) .

Extra metadata are extracted from GIFs in the finalized dataset for further training. Captions within the GIF are extracted using PaddleOCR (Du et al., 2020) , and only extracted text with probability greater than 0.9 are kept as caption metadata.

Object tags and their corresponding features are extracted with bottom-up attention (Anderson et al., 2018) using py-bottom-up-attention package. Object instances are filtered to only keep instances that have a score higher than 0.5, then object tags and their corresponding features are extracted from these instances. Final object features of dimension 2054 are obtained by concatenating feature output with dimension 2048 from Faster-RCNN with scaled box position coordinates of the object following (Li et al., 2020) .

The PEPE model is trained on the finalized dataset with extracted caption and object metadata. It was trained for 16 epochs with a batch size of 8 using AdamW optimizer of learning rate 1e-6 and weight decay 1e-3. Preprocessing for GIFs is Note: * p<0.1; * * p<0.05; * * * p<0.01 

Bottom-up and top-down attention for image captioning and visual question answering

VQA: visual question answering

congratulations, cool, crazy, creeping, cringe, crushing, cry, curtsy, cute, damn, dance, dancing, deadpan stare, debate, depressed, dickhead, disagree, disappointed, disapprove, disbelief, disgust, dislike, diss, divertente, dont care, doubt, doubtful, drink, drinking, drunk, dubious, dying, eating, eating popcorn, embarassed, engrossed, ennui, excited, face palm, faint, fingers crossed, flirt, flushed, freaking out, frustrated, fuck, fun, funny, gagging, get well, glare, good luck, gossip, grateful, gratitude, great, great job, grin, hahahah, happy, happy dance, head shake, hide, high five, hilarious, honestly, hope, horror, hugs, hugs love, hysterical, ill, impressed, incredulous, insult, interested, interesting, judge, judging you, just, keep going, kiss, laugh, leaving, lets go, lies, like, looking, looking around, love you, lovely, luv u, luv you, mad, mind blown, mock, motivational, moved, muah, much appreciated, nah, nasty, need, nervous, nice one, no, nod, not amused, not funny, not interested, oh shit, overwhelmed, panic, partying, perfect, pissed, please, pleased, pointing, praise, pray, pregnant, proud, pumped, questioning, raises hand, realization, relief, respect, reunited, roast, roll eyes, sad, sadness, salute, sarcastic, savage, scared, scary, screaming, secret, seriously, sexy, shame, shock, shook, shrug, shut up, shy, sigh, sips tea, sitting, sleepy, sloth, smart, smile, smug, sobbing, sorpren, sorry, spit, stoked, stressed, stunned, success, sudden realization, surprise, suspicious, sweating, swoon, swooning, take notes, tantrum, tears, thank, think, thirsty, thumbs down, thumbs up, tired, too funny, touched, unamused, unbelievable, uncomfortable, unhappy, unimpressed, unsure, upset, vomit, waiting, wave, weary, weird, whatever, will, wince, wink, wrestling, yawn, yell, yes, yum D List of filtering keywords on Imgur experiment depression, depressing, mental, health

We thank the reviewers, area chairs, and senior area chairs for their thoughtful comments and feedback. We also thank the Blablablab for helpful feedback and letting us deploy PEPE to the group's Slack and putting up with the ridiculous gif replies and Imgur for being a wonderful community. This material is based upon work supported by the National Science Foundation under Grant No. 2007251 

Note: * p<0.1; * * p<0.05; * * * p<0.01