key: cord-0673886-wwfozwin authors: Mao, Qianren; Li, Xi; Peng, Hao; Liu, Bang; Guo, Shu; Li, Jianxin; Wang, Lihong; Yu, Philip S. title: Attend and Select: A Segment Attention based Selection Mechanism for Microblog Hashtag Generation date: 2021-06-06 journal: nan DOI: nan sha: 77be6c8b362f2ca96756c397b01654c0d6804996 doc_id: 673886 cord_uid: wwfozwin Automatic microblog hashtag generation can help us better and faster understand or process the critical content of microblog posts. Conventional sequence-to-sequence generation methods can produce phrase-level hashtags and have achieved remarkable performance on this task. However, they are incapable of filtering out secondary information and not good at capturing the discontinuous semantics among crucial tokens. A hashtag is formed by tokens or phrases that may originate from various fragmentary segments of the original text. In this work, we propose an end-to-end Transformer-based generation model which consists of three phases: encoding, segments-selection, and decoding. The model transforms discontinuous semantic segments from the source text into a sequence of hashtags. Specifically, we introduce a novel Segments Selection Mechanism (SSM) for Transformer to obtain segmental representations tailored to phrase-level hashtag generation. Besides, we introduce two large-scale hashtag generation datasets, which are newly collected from Chinese Weibo and English Twitter. Extensive evaluations on the two datasets reveal our approach's superiority with significant improvements to extraction and generation baselines. The code and datasets are available at url{https://github.com/OpenSUM/HashtagGen}. Microblog has become one of the most popular media with hundreds of millions of user-generated posts for users to present, spread, and obtain information. However, low-quality, irregular hashtags affect the social platform's acquisition and management of information. Besides, lots of cases in the massive microblogs lack user-provided hashtags. For example, less than 15% tweets contain at least one hashtag (Wang et al., 2011; Wang et al., 2019b) . There is an increasing demand for managing these large-scale microblog contents, and topic hashtagging is an effective means of information retrieval and content management. Existing hashtag systems mainly rely on manual editing and have many problems. Firstly, human annotation is time-consuming and costly. Secondly, artificial construction may produce intentionally misleading tags or is inconsistent with the semantics that the post's text conveys. Hashtag generation aims to summarize the main ideas of a microblog post and annotate the post with short and informal topical tags. Most previous works focus on keyphrases or keywords extraction methods (Godin et al., 2013; Gong et al., 2015; and hashtags classification methods (Weston et al., 2014; Zhang et al., 2017) from the given tag-catalogs. However, extraction-based approaches fail to generate keyphrases that do not appear in the source document. These keyphrases are frequently produced by human annotators, as two cases being shown in Figure 1 . It is also difficult for keyword extraction methods to compose readable phrase-level hashtags. Keyword extraction methods may lose semantic coherence if there exists a slightly different sequential order for keywords appearing in the posts. Hashtag classification methods (Gong et al., 2015; Zhang et al., 2017) can not produce a hashtag that is not in the candidate catalogs list. In reality, a vast variety of hashtags can be created daily, making it impossible to be covered by a fixed candidate list. The prior research from another kind employs sequence-to-sequence generation for phrase generation (Meng et al., 2017; Ye and Wang, 2018; Chen et al., 2018; Chen et al., 2019) . It is worth mentioning Figure 1 : Illustration of two Twitter posts with their hashtags. Keywords or keyphrases (in bold blue) are distributed in several segments. The segment is highlighted with dash line, and the segment's length is fixed to be 5-tokens in the two cases. that the latest works Wang et al., 2019b; Wang et al., 2019a) have obtained stateof-the-art performance of existing small microblog hashtag generation datasets. Such methods suffer from long-term semantic disappearance existed in a recurrent neural network. Hence, they are incapable of capturing the discontinuous semantics among crucial tokens, and they are also not good at distilling critical information. It can be observed from Figure 1 that the segment is fixed to be 5 tokens in these two cases, and critical tokens from different segments are usually discontinuous. Specifically, • Different keywords arranged in a hashtag may originate from various segments 1 . Those segments highlighted in the dash line can reflect the primary semantics of their hashtags. • There usually exist discontinuous semantic dependencies among these crucial phrases. For example, in Case B, the 'This new advancements' refers to '5G'. To solve the issues mentioned above, we propose an end-to-end generative method Segments Selection based deep Transformer (SEGTRM) to select a sequential tokens in a fixed length (segment), and then to use these segmental tokens for generation. In particular, we introduce a novel segment attention based selection mechanism (SSM) to attend and select key contextual content. We prepend an [S] token in the front of the text and use it to obtain global textual representations. We insert multiple [SEG] tokens for a sequential text to split the text into different segments with a fixed length and use these [SEG] tokens to obtain local segmental representations. SSM is first calculated by a similarity score between the global textual representations of [S] and multiple local segmental representations of [SEP] . Then, the top k sorted segmental representations are selected as inputs to the decoder. We propose two kinds of selection mechanisms (i.e., soft-based and hard-based) to select dominant textual representations for further feeding into the decoder. The selected targets are multiple segmental tokens of [SEG] and their collateral textual tokens in the soft-based SSM. However, the selected targets will be multiple segmental [SEG] themselves in the hard-based SSM. Hard-based SSM is ultimately a hierarchical way (more hierarchical than soft-based SSM) to model the compositionality of segmental representations. Both mechanisms are used for solving the problem of discontinuous semantic dependencies of dominant information. To summarize, our main contributions include: • We propose a segments-selection mechanism based on Transformer generation architecture. The method benefits from contextual segment modeling for the selection of crucial semantic segments. • We propose a soft-based and a hard-based selection mechanism modeling different textual granularity to attend and select crucial tokens. The method is aware of filtering out secondary information under different granularity interactions among tokens, segments, and text. • Our proposed model has achieved superior performance to the strong baselines on two newly constructed large-scale datasets. Notably, we obtain absolute improvements for Chinese Weibo and English Twitter hashtag generation. 2 Related Works for Automatic Hashtag Generation Hashtag generation task is a branch of keyphrase extraction task (Meng et al., 2017; Ç ano and Bojar, 2019; Chen et al., 2019; Swaminathan et al., 2020; . There are also differences between the two tasks. The first one summarizes short microblogs in Social Networking Services (SNS), but the keyphrase generation task is to select phrases from a news' document. The keyphrase generation mainly generates multiple discontinuous words or phrases. In contrast, a hashtag can be a keyword or keyphrase, and it could also be a phrase-level short text that describes the main ideas in a short microblog. Thus, the hashtag generation system should rephrase or paraphrase tokens for the generation. Most early works in Hashtag generation focus on extracting phrases from source texts or selecting pre-defined candidates Zhang et al., 2017; Javari et al., 2020) . However, hashtags usually appear in neither the target posts nor the given candidate list. Wang et al. (2019b; 2019a) are the first to approach hashtag generation with a generation framework. In doing so, phrase-level hashtags beyond the target posts or the given candidates can be created. Wang et al. (2019b) realize hashtag generation by topic-aware generation model that leverages latent topics to enhance valuable features. and Wang et al. (2019a) propose to jointly model the target posts and the conversation contexts with bidirectional attention. However, their works require massive external conversion snippets or relevant tweets for modeling. The generated results are directly affected by noisy conversations or other tweets. In reality, the external text does not necessarily exist, and there is a high cost of annotation for these external texts. In addition, the dataset 2 they released also have disadvantages, such as small-scale, sparse distribution, and insufficient domain. Problem Formulation. We define the microblog hashtag generation as a given microblog post, automatically generating a sequence consisting of condensed topic hashtags. Each hashtag is separated by separators \#. The task can be regarded as one of the subclasses of text generation. However, the hashtag generation system is applied to learn the mapping from a post to multiple target hashtags. Model Architecture. As shown in Figure 2 segments-selector (which will be introduced in detail in section 3.2) selects multiple segments and recombines them into a new sequence as the decoder's input to generate hashtags in an end-to-end way. To ensure the batch processing and dimension alignment, we set each segment's length being fixed 4 . We also initialize segment embeddings to differentiate segments. To simultaneously predict multiple hashtags and determine the suitable number of hashtags in the generator, we follow the settings (Yuan et al., 2020) by adopting a sequential decoding method to generate one sequence consisting of multiple targets and separators. We insert multiple '\#' tokens as separators. During generation, the decoder stops predicting when encounters terminator [SEP] . The segmentation to represent different granular text has been successfully used in language models (LMs), such as BERT (Devlin et al., 2019; Clark et al., 2019) ). However, segment embeddings of LMs are used to distinguish different sentences in natural language inference tasks. Unlike segmentation in BERT, we aim to represent different segmental sequences by inserting multiple special tokens [SEG] . A visualization of this construction can be seen in Figure 2 . We assign interval segment embeddings [E A , E B , E C , ..., E K ] to differentiate multiple segments. Each token's embedding is the sum of initial token embedding, position embedding, and segment embedding. The input I X of a post text is represented as Our SEGTRM's encoder (termed as SgT) equipped with three kinds of attention mechanisms to different textual granularity: text, segment, and token, as shown in Figure 3 . In SgT, to let [SEG] learn the local semantical representations, we assign a mask vector to each token based on the fixed segment length. Taking Figure 3 (b) as an example, the mask vector of the first segment is [0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]. After obtaining the mask vector for each token in a specific segment, we stack them to form a n * n mask matrix M sgi and calculate local i-th segment attentions with the equation below. For simplification, we write it in the one-head form. where Q refers to the query matrix, K the key matrix, V the value matrix and √ d k a scaling factor. The text attention and token attention are the same as the multi-head self-attention in vanilla Transformer. In the encoder SgT, textual representations are learned hierarchically , H X = SgT (E X ). In other words, the model is aware of the hierarchical structure among different textual granularity. Lower SgT layers represent adjacent segments, while higher layers obtain contextual multi-segments representations. The upper layer of the encoder is a segments-selection block used to select discontinuous tokens. , the similarity weight s i is calculated by a similarity score function f . There are multiple choices for the score function f in our model. Here, we introduce four commonly used similarity metrics that can serve as f : After calculating the similarity score by f , top k H i [SEG] with the highest similarity score are selected: where τ k is the set of top k results in the sequence of representations SEG] , and [·] k 1 is a set of all similarity scores. Then, segmental tokens are collected to form a new sequence of hidden representations H X s . As the two aforementioned selection methods to model the segmental compositionality, the soft segment selection method will select top k [SEG] and their collateral textual tokens. As Figure 4 (a) shown, the selected sequence is is an appending operation. The H sgi refers to a sequence of tokens' hidden representations H i x1 , H i x2 , .., H i xn . The hard segment selection method will only select top k [SEG]. As the Figure 4 ]. The hard-based SSM does not select any original word as the decoder's input and is to model the segmental compositionality. Further, these selected pieces are fed into the decoder for hashtag generation. Inspired by the sequential decoding method (Yuan et al., 2020) , we insert separators # in the middle of each target and insert a [SEP] at the end of the sequence. In doing so, the method can simultaneously predict hashtags and determine a suitable number of hashtags. The multiple hashtags are obtained after splitting the sequence by separators. Thus We implement 12 deep layers in both the encoder and decoder. The embedding size and hidden size of both encoder and decoder are set to 768. The number of self-attention heads is 12. We use the crossentropy loss to train the models. The optimizer is Adam with a learning rate 1e-4, L2 weight decay β1 = 0.9, β2 = 0.999, and = 1e-6. The dropout probability is set to 0.1 in all layers. Following OpenAI GPT and BERT, we use a gelu activation which performs better than the standard relu. The gradient clipping is applied with range [-1, 1] in the encoder and decoder. We implement a linear warmup with a Ratio (Howard and Ruder, 2018) of 32. The Ratio specifies how much smaller the lowest learning rate compare the maximum one. The proportion of warmup steps is 0.04 on the WHG and THG datasets. We use LTP Tokenizer 5 and RoBERTa's FullTokenizer (Devlin et al., 2019) for preprocessing Chinese Weibo characters and English Twitter words, respectively. The length of the input and output are considered to be CovSourceLen and CovTargetLen in Table 1 . All models are trained on 4 GPUs 6 . We select the best 3 checkpoints on the validation set and report the average results on the test set. The hyperparameters of the number of Top k selected segments will be introduced in the experimental analysis (section 5.2). The existing Twitter dataset (Wang et al., 2019b) is built based on the TREC 2011 microblog track 7 , and most of the tweets obtained by this tool are invalid. The existed Chinese dataset (Wang et al., 2019b) also has some obvious shortcomings. Firstly, the dataset contains only 40,000 posts for hashtag generation, and there is a long-tail distribution that 19.74% of the hashtags are composed of one word. Secondly, the lengths of 16.88% posts are less than 10 words, and 84.90% posts' lengths are less than 60, which is not consistent with Weibo's real-world data. Thirdly, we also find that such a small size dataset hardly lets Transformer converge. In terms of content, entertainment-related text accounts for the majority of their dataset. It is hardly practical through fine-tuning (whatever based on language models or based on our SEGTRM). Thus, there is an apparent semantic bias between the pre-training data (multiple domains) and their corpus (entertainment domain). We construct two new large-scale datasets: the English Twitter hashtag generation (THG) dataset and the Chinese Weibo hashtag generation (WHG) dataset. The construction details are introduced in the Appendix. We use 312,762 post-hashtag pairs for the training, 20,000 pairs for the validation, and 20,000 pairs for the test. As Table 1 shown, the number of hashtags is 4.1, and their total sequence length is about 10.1, and thus on average, there are 2.5 tokens per hashtag. Hashtags of twitter are extremely shorter than Weibo's hashtags. The implemented baselines can be clarified into three types: keywords extraction method, neural selective encoding generator, and Transformer-based generator. Ext.TFIDF is an extraction method. We extract 3 keywords for WHG (2 keywords for THG) according to the average number of tokens formed by a hashtag and concatenate them following the tokens' order in the text. We also introduce an exemplary keyphrase generation method ExHiRD which is augmented with a GRU-based hierarchical decoding framework. Selective encoding models are neural generation methods based on the selection of key-information pieces. We choose two content selectors: SEASS (Zhou et al., 2017) and BOTTOMUP (Gehrmann et al., 2018) . The content selector of SEASS is based on selective attention, and the framework is LSTM-based sequence-to-sequence. The selector of BOTTOMUP 8 is a Transformerbased model augmented with bottom-up selective attention. BOTTOMUP determines which phrases in the We use the official ROUGE script 9 (version 0.3.1) as our evaluation metric. We report ROUGE F1 to measure the overlapping degree between the generated sequence of hashtags and the reference sequence, including unigram, bigram, and longest common subsequence. The reasons we choose the ROUGE as our evaluation are two aspects. Firstly, the task aims to generate sequential hashtags, and ROUGE is a prevailing evaluation metric for the generation task. Secondly, we find multiple hashtags are helpful to reflect the relevance of the target post to the hashtag. Although these hashtags (e.g., '#farmers', '#market', and '#organic farmers') are not the same as the reference one (e.g., '#organic farmers market'), they are useable. The n-gram overlaps of ROUGE will not miss the highly available hashtags, but F1@K will since it can only evaluate hashtags that are identical to the reference one. To measure those correct tokens which are not identical to reference tokens but are copied from the source text, we test the n-gram overlaps between the generated text and source text. This evaluation metric can identify the extraction ability of models. We also use F1@k evaluation (a popular information retrieval evaluation metric) to verify the ability of our model to normalize a single hashtag. F1@k compares all the predicted hashtags with groundtruth hashtags. Beam search is utilized for inference, and the top k hashtag sequences are leveraged for producing the final hashtags. Here we use a beam size of 20, and k as 10. Since our model can generate multiple hashtags (separated by #) for a document, the final F1@k are tested for multiple hashtags. Table 2 , we release results on Weibo and Twitter hashtag generation datasets separately. Our SEGTRM (soft) consistently gets the best performance on both datasets. Its hard-based version is superior to Softbase model. However, hard-based selection models are invalidated on the Twitter dataset. The reason may be that the text of the Twitter post is too short. The average length of the original text is 23, which is far less than the 75 of the Weibo post. In the Twitter post, each segment can attend less content, which is not conducive to the generation of multiple hashtags. According to F1@k scores, we find it is difficult for Twitter to normalize the hashtags. The reason probably is that hashtags are being low-frequency words or abbreviations. Those rare hashtags (less frequent in distribution) are not fully trained, resulting in low F1 scores for English Twitter. Moreover, the models' efficiency about the ROUGE-2 on the THG dataset is worse than that of the WHG dataset. It may be that English data contains massive abbreviated hashtags and single-word hashtags, which causes the insufficient training of those low-frequency hashtags. Comparison with keywords generation method. Our SEGTRM (soft) obtains significant improvements on most of the metrics (paired t-test, p < 0.05), compared with keywords extraction method TFIDF, and keyphrase generation method ExHiRD. We conclude that keyword extraction methods hardly adapt to large-scale datasets since they can not reorganize words appropriately. Besides, ExHiRD has inherent defects, such as insufficiency of long-term sequence dependency. Another serious drawback is that these models are hard to generate phrase-level Weibo hashtags. Comparison with selective encoding systems. Our SEGTRM, with hard or soft SSM, is superior to the two selective encoding models, SEASS and BOTTOMUP. Whether in ROUGE or F1@K, we find that the performance of our selective model always appears to be better than SEASS and BOTTOMUP with a certain margin. SEASS fails in its long-term semantic dependence. BOTTOMUP fails in its complex joint optimization on two objectives of word selection and generation. Comparison with salient Transformer generator. Among those Transformer-based models, our SEGTRM is superior to the selective encoding model BOTTOMUP and the salient TRANSABS. The superiority of our model can be attributed to the explicit selection of dominant pieces and modeling of the segmental compositionality. Performance of SSMs. Firstly, the results of ablation experiments (without any SSM) are to compare base models (SEGTRM Softbase) with soft SEGTRM, in Table 2 . The results indicate the superiority of using SSM. Secondly, the results of different SSMs can be seen in Table 3 . To simplify the description, we use '#SSM' to represent a method. Among the results of hashtag generation on the WHG dataset, #MhtS, #CS, and #MasS always obtain superior performance. #ES always gets the worst performance, whether for hard or soft segments selection, which indicates a poor segment selection will produce adverse input (e.g., it does not pick out critical information) to the decoder, making a worse generation. For Twitter hashtag generation, #C, #MasS and #MhtS models outperform corresponding Softbase or Hardbase. #MhtS obtains the best performance among hard-based SSM models, and #MasS is the best among soft-based SSM models. Observing the loss curves in Figure 5 .(e) and Figure 5 .(f) and ROUGE F1 results on These results also indicate that hashtags are mostly assembled from scattered semantic pieces, and attending to those key segments can distill unnecessary information and stabilize performance. To test the extractive ability of our systems, we illustrate the comparison of n-gram overlaps for our models. In Figure 7 , the generated hashtags of #MhtS overlap the post text more often than other selection methods on the WHG dataset. 59.42% of generated 1-gram is duplicated from posts' 1-gram, as shown in Figure 7 (a). For the proportion of 2-gram overlaps, as shown in Figure 7 (b), #MhtS is almost close to golden hashtags, with only tiny differences. The results indicate that our model can duplicate tokens from the source text and simultaneously retain accuracy. Our segment selection mechanism makes the system more reliable to reorganize key details correctly. Almost all soft SSMs, except for #ES, rewrite substantially more abstractive hashtags than our base model, which has no segment selection mechanism. Our segment selection model allows the network to copy words from the source text and consults the language model simultaneously to extract words from vocabulary, enabling operations like truncation and stitching to be performed accurately. To generate microblog hashtags automatically and effectively in a large-scale dataset, we propose a semantic-fragmentation-based selection mechanism in the deep Transformer's architecture. The experimental results validated in two constructed large-scale datasets indicate that our model achieves state-ofthe-art effects with significant improvements. There also exist some known limitations of our framework for future improvements. The SEGTRM is an end-to-end method relying on a large-scale corpus. Such a corpus makes the hashtag classification not applicable since it is challenging to unify the classification labels. Due to a foreseeable workload (e.g., indefinite clause-based generation), we will also apply an indefinite length segmentation scheme in future works. Observe press, Xinlang Sports, and other accounts with more than 5 million followers come from different domains of politics, economic, military, sports, etc. The post text and hashtag pairs are filtered, cleaned, and extracted with artificial rules. We remove those pairs with too short text lengths (less than 60 characters) that only account for a small part of all data. Statistics of the WHG datasets are shown in Table 4 , in which about 10.32% of the hashtags contain new words that do not appear in a post text, and 61.63% of the hashtags have words that appear in three or more different segments. At most, 15 segments contain words from hashtags. THG construction: We use TweetDeck 11 to get and filter tweets. We collect 200 seed accounts, such as organizations, media, and other official users, to obtain high-quality tweets. Then the Twitter posthashtag pairs are crawled from the seed users. The tokenization process is integrated into the training, evaluating, and testing step. We use RoBERTa's FullTokenizer and vocabulary (Devlin et al., 2019) . Table 4 shows that about 10.20% of the hashtags contain new words that do not appear in posts. About 28.41% of the hashtags consist of a single word or abbreviation. At most, 4 segments contain words from hashtags. We employ 204,039 post-hashtag pairs for training in the THG dataset, 11,335 and 11,336 pairs for validation and test, respectively. We illustrate some generated hashtags of our implemented models for Chinese Weibo and English Twitter. As indicated by the examples in Table 5 , all generated hashtags have pinpointed the core meaning of posts with fluency. Hashtags are truncated to form shorter versions and are composed of discontinuous tokens. To compare the generations of our models to Golden hashtag, we find two base models and hardbased SEGTRM have generated useable hashtags (e.g., '#farmers', '#market', and '#organic farmers') which are duplicate tokens coming from source text. Although these hashtags are not the same as the Golden (resulting in a low F1 value), they are useable. This case also proves the reason why we choose ROUGE evaluation, namely n-gram overlaps will not miss the highly available hashtags. Hashtags generated by SEGTRM are almost entirely consistent with the golden ones. For English twitter, it is not easy to generate hashtags or abbreviations. For example, in case 2, there are two hashtags, where 'U20' refers to 'Urban 20' in the original text. Our SEGTRM directly selects the phrase 'Urban 20' in the original text as the generation result. This is an apparent correct hashtag but results in a comparative low ROUGE score. This case also shows the necessity of using n-gram overlaps to evaluate the performance, which will not omit the case of choosing the correct phrase as a hashtag from the original text. In the third case of Weibo hashtag generation, the Hardbase generates 'BRICS in Yaolu island'. Although two generated results are not identical to the golden one, they are all correct facts. For example, 'Yaolu' an island of 'Xiamen' is the specific location of the 'BRICS conference'. There exists a wrong generation in the fourth case. Hashtags generated by Hardbase of SEGTRM contain an unrelated term of 'Juventus' which is a club in Italy. That may be attributed to some retained low-frequency words that are hard to be adequately trained and differentiate semantic distance. The Twitter post for hashtag generation: We're re-opening our Helen Albert Certified Farmers' Market on Monday, September 14 from 9 AM to 2 PM with new safety measures in effect. The Farmers' Market features organic and farm fresh fruits and vegetables, baked goods, fresh fish, and more. Golden: # organic farmers market SEGTRM Softbase: # farmers market SEGTRM Hardbase: # farmers #market SEGTRM (hard): # organic farmers #market SEGTRM (soft): # organic farmers market The Twitter post for hashtag generation: An event organized by the Italian Presidency of G20, UNDP and UNEP, with the contribution of Urban 20 focused on multi-level governance aspects of Nature-based solutions in cities. Golden: # G20 Italy # U20 SEGTRM Softbase: # G20 Italy # G20 SEGTRM Hardbase: # G20 SEGTRM (hard): # G20 Italy # Urban 20 SEGTRM (soft): # G20 Italy # Urban 20 The Weibo post for hashtag generation: 9月3日下午在厦门召开的金砖国家工商论坛开幕式上,国家主席习近平 发表题为《共同开创金砖合作第二个"金色十年"》的主旨演讲。 On the afternoon of September 3rd, at the opening ceremony of the BRIC industrial and commercial forum held in Xiamen, President Xi Jinping delivered a Keynote speech entitled Jointly Creating the Second 'Golden Decade' of BRICS Cooperation. The Weibo post for hashtag generation: 全场比赛结束,巴塞罗那主场5:0战胜西班牙人赢得本赛季 首场同城德 比,梅西上演帽子戏法,皮克、苏亚雷斯锦上添花,拉基蒂奇和阿尔巴分别贡献两次助攻,登贝莱首秀助攻苏亚 雷斯。 At the end of the match, Barcelona defeated the Spaniards 5-0 at home to win the first derby in the same city this season. Messi staged a hat-trick, Pique and Suarez were icing on the cake, Rakitic and Alba contributed two assists, Dembele assisted Suarez in his first match. Table 5 : Four cases of the generated hashtags for Weibo posts and a generation case for Twitter posts. The last two cases in Chinese are carefully translated as shown in brackets for the convenience of reading and comparison. Keyphrase generation: A text summarization struggle Keyphrase generation with correlation constraints Title-guided encoding for keyphrase generation Exclusive hierarchical decoding for deep keyphrase generation What does BERT look at? an analysis of bert's attention. CoRR BERT: pre-training of deep bidirectional transformers for language understanding Bottom-up abstractive summarization Evaluating the descriptive power of instagram hashtags Using topic models for twitter hashtag recommendation Hashtag recommendation using attention-based convolutional neural network Hashtag recommendation using dirichlet process mixture models incorporating types of hashtags Universal language model fine-tuning for text classification Hashtag recommendation using end-to-end memory networks with hierarchical attention Weakly supervised attention for hashtag recommendation using graph data Hierarchical transformers for multi-document summarization Deep keyphrase generation Rajiv Ratn Shah, and Amanda Stent. 2020. A preliminary exploration of gans for keyphrase generation Topic sentiment analysis in twitter: a graph-based hashtag sentiment classification approach Topic-aware neural keyphrase generation for social media language Microblog hashtag generation via encoding conversation contexts #tagspace: Semantic embeddings from hashtags Semi-supervised learning for neural keyphrase generation One size does not fit all: Generating and evaluating variable number of keyphrases Keyphrase extraction using deep recurrent neural networks on twitter Hashtag recommendation for multimodal microblog using co-attention network Encoding conversation context for neural keyphrase extraction from microblog posts Selective encoding for abstractive sentence summarization The acknowledgments should go immediately before the references. Do not number the acknowledgments section. Do not include this section when submitting your paper for review. Most of the posts are attached with informative hashtags in the Social Networking Services (SNS) platform. Existing works (Giannoulakis and Tsapatsoulis, 2016) have proved that the natural annotation hashtags (such as Instagram hashtags) can be used as training examples for machine learning algorithms consistent with posts. In Figure 1 , the microblog user has posted a hashtag '#5G Bring New Value', and we treat such natural user-provided hashtags as ground-truth for training, validation, and test. Besides, post-hashtags are selected by users such as official media and influencers whose labeled hashtags are of high quality. These premises make it reasonable to directly use the user-annotated hashtags in microblogs as the ground-truth hashtags. We take the user-annotated hashtags appearing in the beginning or end of a post as the reference as works Wang et al., 2019b ) did 10 .WHG construction: We collect the post-hashtag pairs by crawling the microblogs of seed accounts involving multiple areas from Weibo. These seed accounts, such as People's Daily, People.cn, Economic