key: cord-020835-n9v5ln2i authors: Jangra, Anubhav; Jatowt, Adam; Hasanuzzaman, Mohammad; Saha, Sriparna title: Text-Image-Video Summary Generation Using Joint Integer Linear Programming date: 2020-03-24 journal: Advances in Information Retrieval DOI: 10.1007/978-3-030-45442-5_24 sha: doc_id: 20835 cord_uid: n9v5ln2i Automatically generating a summary for asynchronous data can help users to keep up with the rapid growth of multi-modal information on the Internet. However, the current multi-modal systems usually generate summaries composed of text and images. In this paper, we propose a novel research problem of text-image-video summary generation (TIVS). We first develop a multi-modal dataset containing text documents, images and videos. We then propose a novel joint integer linear programming multi-modal summarization (JILP-MMS) framework. We report the performance of our model on the developed dataset. Advancement in technology has led to rapid growth of multimedia data on the Internet, which prevent users from obtaining important information efficiently. Summarization can help tackle this problem by distilling the most significant information from the plethora of available content. Recent research in summarization [2, 11, 31] has proven that having multi-modal data can improve the quality of summary in comparison to uni-modal summaries. Multi-modal information can help users gain deeper insights. Including supportive representation of text can reach out to a larger set of people including those who have reading disabilities, users who have less proficiency in the language of text and skilled readers who are looking to skim the information quickly [26] . Although visual representation of information is more expressive and comprehensive in comparison to textual description of the same information, it is still not a thorough model of representation. Encoding abstract concepts like guilt or freedom [11] , geographical locations or environmental features like temperature, humidity etc. via images is impractical. Also images are a static medium and cannot represent dynamic and sequential information efficiently. Including videos could then help overcome these barriers since video contains both visual and verbal information. To the best of our knowledge, all the previous works have focused on creating text or text-image summaries, and the task of generating an extractive multimodal output containing text, images and videos from a multi-modal input has not been done before. We thus focus on a novel research problem of text-imagevideo summary generation (TIVS). To tackle the TIVS task, we design a novel Integer Linear Programming (ILP) framework that extracts the most relevant information from the multimodal input. We set up three objectives for this task, (1) salience within modality, (2) diversity within modality and (3) correspondence across modalities. For preprocessing the input, we convert the audio into text using an Automatic Speech Recognition (ASR) system, and we extract the key-frames from video. The most relevant images and videos are then selected in accordance with the output generated by our ILP model. To sum up, we make the following contributions: (1) We present a novel multimodal summarization task which takes news with images and videos as input, and outputs text, images and video as summary. (2) We create an extension of the multi-modal summarization dataset [12] by constructing multi-modal references containing text, images and video for each topic. (3) We design a joint ILP framework to address the proposed multi-modal summarization task. Text summarization techniques are used to extract important information from textual data. A lot of research has been done in the area of extractive [10, 21] and abstractive [3, 4, 19, 23] summarization. Various techniques like graph-based methods [6, 15, 16] , artificial neural networks [22] and deep learning based approaches [18, 20, 29] have been developed for text summarization. Integer linear programming (ILP) has also shown promising results in extractive document summarization [1, 9] . Duan et al. [5] proposed a joint-ILP framework that produces summaries from temporally separate text documents. Recent years have shown great promise in the emerging field of multi-modal summarization. Multi-modal summarization has various applications ranging from meeting recordings summarization [7] , sports video summarization [25] , movie summarization [8] to tutorial summarization [13] . Video summarization [17, 28, 30] is also a major sub-domain of multi-modal summarization. A few deep learning frameworks [2, 11, 31] show promising results, too. Li et al. [12] uses an asynchronous dataset containing text, images and videos to generate a textual summary. Although some work on document summarization has been done using ILP, to the best of our knowledge no one has ever used an ILP framework in the area of multi-modal summarization. Our objective is to generate a multimodal summary S = X sum I sum V sum such that the final summary S covers up all the important information in the original data while minimizing the length of summary, where Each topic in our dataset comprises of text documents, images, audio and videos. As shown in Fig. 1 , we firstlextract key-frames from the videos [32] . These keyframes together with images from the original data form the image-set. The audio is transcribed into text (IBM Watson Speech-to-Text Service: www.ibm.com/ watson/developercloud/speech-to-text.html), which contributes to the text-set together with the sentences from text-documents. The images from then imageset are encoded by the VGG model [24] and the 4,096-dimensional vector from the pre-softmax layer is used as the image representation. Every sentence from the text-set is encoded using the Hybrid Gaussian-Laplacian Mixture Model (HGLMM) into a 6,000-dimensional vector. For text-image matching, these image and sentence vectors are fed into a two-branch neural network [27] to have a 512-dimensional vector for images and sentences in a shared space. ILP is a global optimization technique, used to maximize or minimize an objective function subject to some constraints. In this paper, we propose a joint-ILP technique to optimize the output to have high salience, diversity and crossmodal correlation. The idea of joint-ILP is similar to the one applied in the field of across-time comparative summarization [5] . However, to the best of our knowledge, an ILP framework was not used to solve multi-modal summarization (Gurobi optimizer is used for ILP optimization: https://www.gurobi.com/). Decision Variables. M txt is a n × n binary matrix such that m txt i,i indicates whether sentence s i is selected as an exemplar or not and m txt i,j =i indicates whether sentence s i votes for s j as its representative. Similarly, M img is a p × p binary matrix that indicates the exemplars chosen in the image set. M c is n × p binary matrix that indicates the cross-modal correlation. m c i,j is true when there is some correlation between sentence s i and image I j . where mod, t, item ∈ { text, n, s , img, p, I } is used to represent multiple modalities together in a simple way. We need to maximize the objective function in Eq. 1, containing salience of text, images and cross-modal correlation. Similar to the joint-ILP formulation in [5] the diversity objective is implicit in this model. Equation 4 generates the set of entities that are a part of the cluster whose exemplar is item i . The salience is calculated by Eqs. 2 and 3 by taking cosine similarity over all the exemplars with the items belonging to their representative clusters separately for each modality. The cross-modal correlation score is calculated in Eq. 5. Equation 7 ensures that exactly k txt and k img clusters are formed in their respective uni-modal vector space. Equation 8 guarantees that an entity can either be an exemplar or be part of a single cluster. According to Eq. 9, a sentence or image must be exemplar in their respective vector space to be included in the sentence-image summary pairs. Values of m, k txt and k img are set to be 10, same as in [5] . The Joint-ILP framework outputs the text summary (X sum ) and top-m images from the image-set. This output is used to prepare the image and video summary. Equation 11 selects all those images from top10 images that are not keyframes. Assuming that images which look similar would have similar annotation scores and would help users gain more insight, the images relevant to the images in I sum1 (at least with α cosine similarity) but not too similar (at max with β cosine similarity) to avoid redundancy are also selected to be a part of the final image summary I sum (Eq. 12). α is set to 0.4 and β is 0.8 in our experiments. Extracting Video. For each video, weighted sum of visual (Eq. 13) and verbal (Eq. 14) scores is computed. The video with the highest score is selected as our video summary. where KF is the set of all key-frames and ST is the set of speech transcriptions. There is no benchmark dataset for the TIVS task. Therefore, we created our own text-image-video dataset by extending and manually annotating the multi-modal summarization dataset introduced by Li et al. [12] . Their dataset comprised of 25 new topics. Each topic was composed of 20 text documents, 3 to 9 images, and 3 to 8 videos. The final summary however was unimodal, that is, in the form of only a textual summary containing around 300 words. We then extended it by selecting some images and a video for each topic that summarize the topic well. Three undergraduate students were employed to score the images and videos with respect to the benchmark text references. All annotators scored each image and video on a scale of 1 to 5, on the basis of similarity between the image/video and the text references (1 indicating no similarity and 5 denoting the highest level of similarity). Average annotation scores (AAS) were calculated for each image and video. The value of the minimum average annotation score for images was kept as a hyper-parameter to evaluate the performance of our model in various settings 2 . The video with the highest score is chosen to be the video component of the multi-modal summary 3 . We evaluate the performance of our model using the dataset as described above. We use the ROUGE scores [14] to evaluate the textual summary, and based on them we compare our results with the ones of three baselines. We use the multi-document summarization model proposed in [1] . For Baseline-1 we feed the model with embedded sentences from all the original documents together. The central vector is calculated as the average of all the sentence vectors. The model is given vectors for sentences from the text-set and images from the image-set in the joint space for other baselines. For Baseline-2, the average of all the vectors is taken as the central vector. For Baseline-3, the central vector is calculated as the weighted average of all the sentence and image vectors. We give equal weights to text, speech and images for simplicity. As shown in Table 1 , our model produces better results than the prepared baselines in terms of ROUGE-2 and ROUGE-l scores. Table 2 shows the average precision and recall scores as well as the variance. We set various threshold values for the annotation scores to generate multiple image test sets in order to evaluate the performance of our model. We get a higher precision score for low AAS value, because the number of images in the final solution increases on decreasing the threshold values. The proposed model gave 44% accuracy in extracting the most appropriate video (whereas random selection of images for 10 different iterations gives an average 16% accuracy). Unlike other problems that focus on text-image summarization, we propose to generate a truly multi-modal summary comprising of text, images and video. We also develop a dataset for this task, and propose a novel joint ILP framework to tackle this problem. Multi-document summarization model based on integer linear programming Abstractive text-image summarization using multi-modal attentional hierarchical RNN Fast abstractive summarization with reinforce-selected sentence rewriting Abstractive sentence summarization with attentive recurrent neural networks Across-time comparative summarization of news articles LexRank: graph-based lexical centrality as salience in text summarization Multimodal summarization of meeting recordings Multimodal saliency and fusion for movie summarization based on aural, visual, and textual attention Extractive multi-document summarization with integer linear programming and support vector regression A trainable document summarizer Multi-modal sentence summarization with modality attention and image filtering Multi-modal summarization for asynchronous collection of text, image Multimodal abstractive summarization for open-domain videos ROUGE: a package for automatic evaluation of summaries Graph-based ranking algorithms for sentence extraction, applied to text summarization TextRank: bringing order into text Streaming non-monotone submodular maximization: personalized video summarization on the fly Summarunner: a recurrent neural network based sequence model for extractive summarization of documents Abstractive text summarization using sequence-to-sequence RNNs and beyond Classify or select: neural architectures for extractive document summarization Constructing literature abstracts by computer: techniques and prospects Extractive single document summarization using multi-objective optimization: exploring self-organized differential evolution, grey wolf optimizer and water cycle algorithm. Knowl.-Based Syst Get to the point: summarization with pointergenerator networks Very deep convolutional networks for large-scale image recognition Multi-modal summarization of key events and top players in sports tournament videos Multimodal summarization of complex sentences Learning deep structure-preserving image-text embeddings Video summarization via semantic attended networks Multiview convolutional neural networks for multidocument extractive summarization Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward MSMO: multimodal summarization with multimodal output Adaptive key frame extraction using unsupervised clustering