key: cord-0533453-zr5hpxax authors: Goyal, Navita; Paneri, Roodram; Agarwal, Ayush; Kalani, Udit; Sancheti, Abhilasha; Chhaya, Niyati title: CaM-Gen:Causally-aware Metric-guided Text Generation date: 2020-10-24 journal: nan DOI: nan sha: 465bae97532466f36dfe91eaab01aee826f253e0 doc_id: 533453 cord_uid: zr5hpxax Content is created for a well-defined purpose, often described by a metric or signal represented in the form of structured information. The relationship between the goal (metrics) of target content and the content itself is non-trivial. While large-scale language models show promising text generation capabilities, guiding the generated text with external metrics is challenging. These metrics and content tend to have inherent relationships and not all of them may be of consequence. We introduce CaM-Gen: Causally aware Generative Networks guided by user-defined target metrics incorporating the causal relationships between the metric and content features. We leverage causal inference techniques to identify causally significant aspects of a text that lead to the target metric and then explicitly guide generative models towards these by a feedback mechanism. We propose this mechanism for variational autoencoder and Transformer-based generative models. The proposed models beat baselines in terms of the target metric control while maintaining fluency and language quality of the generated text. To the best of our knowledge, this is one of the early attempts at controlled generation incorporating a metric guide using causal inference. Most content is created for a well-defined goal. For example, a blog writer often publishes articles to gain popularity and trigger conversations, and a columnist may write an opinionated piece to gather feedback. In marketing applications, these goals are business objectives that need to be optimized using the content shared with the customers. The validation of whether the goal was met or not is done by tracking metrics that capture the reader behavior. In social media, metrics include number of comments, likes, or shares whereas for a publishing house they are the number of views and * Work done while at Adobe Research readers. These engagement metrics (hereafter, metrics) are proxy for target goals. Based on historical content, textual content characteristics that successfully achieve the desired metrics can be assessed (Tan et al., 2019; Verma et al., 2020) . Guiding text generation models by these signals is important for meeting the required goals. While recent neural language models have shown tremendous success towards fluent text generation (Radford et al., 2018; Devlin et al., 2019) , achieving controlled, goal-specific generation is challenging. There has been work on text generation controlling for style, topic, or size (Keskar et al., 2019) . These methods are able to leverage content characteristics that are common between the definition of goal (i.e., control) and the text. However, for metrics that are not explicit in the text, controlled generation is non-trivial to codify. The challenge is introduced due to the fact that for external metrics, there is a need to first identify the relationship between the content characteristics and the metric and then to explicitly introduce a guide/constraint enabling the generator to learn the desired content properties. Contrary to style, these choices might be difficult for a layman to manually identify and input to the generative models. Textual content is an amalgam of various linguistic features -lexical, pertaining to word choices; semantics, concerned with the meaning; syntactic, relating to parts of speech tags; and surfacelevel features, comprising punctuation, word count, sentence count, etc. To avoid misinformation (or clickbait-y) generation, automated tools should be able to alter the syntactic and surface-level characteristics of text to meet the desired outcome. Explicitly identifying features of interest that result in intended outcome can enable finer control. In this paper, we first discuss method to identify a subset of these features that have direct and significant impact on the outcome metric, derived from causality literature (Funk et al., 2011) . A causally signifi-cant relationship helps encode the 'if this, then that' logic; adding such a guide for the generator can help ensure on-metric generation. In this paper, we propose causal guidance mechanism for two modeling frameworks that are used for metric-guided generation -conditional variational autoencoders (Sohn et al., 2015) and Transformerbased language models (Vaswani et al., 2017) . For conditional variational autoencoders (CVAE), we modify the VAE graph to introduce causal guidance. In Transformer-based language models, we introduce causal guidance by adding causal losses for explicit feedback on causal features. Our key contributions are introducing causal guidance frameworks for metric-guided, controlled text generation in CVAE and Transformer-based generative models. We experiment with a new dataset of news articles related to COVID-19 along with the NYT-comments dataset, 1 showing improved performance against baseline methods. To the best of our knowledge, this is one of the first attempts towards controlled generation on engagement metrics and inclusion of causal guidance for controlled generation in generative models. The literature on text generation spans various generative models, including variational autoencoder (VAEs), generative adversarial networks (GANs), and sequential models. VAEs have been used for unconditional (Bowman et al., 2016) , as well as constrained text generation (Zhang et al., 2016; Pagnoni et al., 2018) . Pagnoni et al. (2018) generate a sentence sequence y conditioned on the input sentence for machine translation, thus mimicking a sequence-to-sequence model. Hu et al. (2017) control sentiment and tense in text generation using discriminators with VAEs. Zhao et al. (2017) introduce an additional reconstruction network in CVAEs for controlling linguistic features in dialog generation. As we show in our experiments, this does not adapt well to controlled generation where the relationship with the target goal is not as explicit in text. We identify these nuanced relationships between the text and the underlying goal and enable explicit control over the text features influencing the target outcome by modifying the VAE graph. While VAEs enable controlled generation, they do not generate fluent language with limited data. Large Transformer-based language models (Radford et al., 2018; Devlin et al., 2019) have shown efficacy in generating fluent language, allowing for fine-tuning for specific tasks on a smaller dataset while maintaining good language quality. Keskar et al. (2019) introduce style control, such as domain (books, wikipedia, etc.), by conditioning the generated distribution on the style token y, i.e. p(x|y) = n i=1 p(x i |x 0.92 for all treatment features and the potential outcome model performs well for Upvote and Comment count. We use these as target metrics in generative models for NYT dataset. Similar analysis on Webhose data yields Participation and Replies count as target metric. Fig. 4 shows Average Treatment Effect (ATE) of various text features on these outcome metrics. We empirically choose significance level of 0.1 and consider features with ATE of greater than 0.1 (in magnitude) as 'causally significant' features. We include these as causal features in the generative models. Causal Analysis. We note that the fastText classifiers used for metric evaluation have relatively low accuracy (although much better than a random 33% . We attribute this to high variability in the text and unpredictability of resulting engagement. As discussed previously, a causal analysis of historical text accounts for semantic and topical variation. Similarly, a causal analysis of generated data, and subsequent comparison with historical trends, could compensate for any potential inadequacies of classifier-based evaluation. To this end, we perform a causal analysis of the text generated by the baseline and our proposed model. We generate text with high, medium and low target participation count (pcount) as target and record average value of various treatment features (Fig. 5) . Here, the word and sentence counts are normalized and POS features are fraction of words with certain POS tag over total number of words in the generated text. We test the adoption of 'causally significant' features in the causal model by analyzing feature distributions of text generated by causal model and baseline Transformer model across classes (high/medium/low). For instance, word count has a negative ATE on pcount (Fig. 4a) . Thus, we would expect a text with higher word count to have lesser pcount. As seen in Fig. 5a , our causal model with 'high' target pcount generated articles with lower word count on average than the causal model with 'low' target (red and blue bars in first group in Fig. 5a respectively) . Similar trends are observed across other 'causally significant' treatment features. In contrast, the text generated by baseline model (Fig. 5b ) either do not show significant variation in these features across text generated with high, medium and low target or the difference is inconsistent, reflecting the lack of control over aspects of text in baseline models where generation is only guided by target metric. As these features, by definition, significantly im-pact the outcome; this analysis adds further confidence in stronger adherence to the target metric in our proposed causal approach over the baseline. We present a framework for causally aware metricguided generation in VAE and Transformer-based models. We successfully identify causally significant text features using causal analysis and incorporate them into the generative model. We show that integrating causal guidance in guided generation enables better control over the target metric, while maintaining language quality. Our proposed causally guided Transformer model shows improved performance across datasets. Moreover, we show that the generated text adheres to these causal features, in line with their observed effect in historic data. This exploration opens up avenues for leveraging causality for controlled generation. Ethics Statement. We recognize and acknowledge that our work carries a possibility of misuse for fake news generation, the same as any text generation system. We strongly recommend coupling any such technology with a fake news detection and review system before deployment. We do not believe that our method exacerbates fake news generation as it aims to optimize syntactic and surfacelevel features, and not topical or semantic features. On the contrary, having a causal guidance towards these specific factors may guide models to focus on these features and deter them from other nondesirable optimization of content. The data and approaches for generating text that optimizes for clicks exist already. Our proposed approach adds a nuanced control on the linguistic features to optimize for generating desirable content, rather than unconstrained optimization for clicks. The graph for non-causal conditional generation using variational autoencoder is shown in Fig. 1 (left). As discussed in section 4.1, we approximate the intractable posterior distribution p θ (z|x, c, y) with the recognition network q φ (z|x, c, y), where The variational parameters φ are chosen such that the approximate posterior distribution q φ (z|x, c, y) is as close to the true posterior distribution p θ (z|x, c, y) as possible. This is done by minimizing the KL divergence between the two distributions. Thus, where the KL divergence is given by, Rearranging equation 12 gives, We want to minimize the KL divergence term on R.H.S. of equation 13. Since, the KL divergence is ≥ 0, the variational lower bound on the log likelihood log p θ (x) is given by Using equation 10, we get Replacing in equation 14, we get the variational lower bound for non-causal CVAE as L(θ, φ; x, c, y) = E q φ (z,y|x,c) log p θ (x|c, z, y) As discussed in section 4.2, we add causal guidance in CVAE framework by adding the treatment vector t for aligning the latent space of the Variational Autoencoder. The posterior distribution for the causal-CVAE graph in Fig. 1 (right) is approximated by q φ (z|x, c, y). Similar to equation 14, we get the variational lower bound for causal CVAE as The conditional posterior q φ (z, y|t, x, c) is given by Thus, KL q φ (z, y|t, x, c)||p θ (z, y|c) = E q φ (y|t,x,c) KL q φ (z|t, x, c, y)||p θ (z|c, y) Using this in equation 17 gives us the variational lower bound for causal CVAE as L(θ, φ; t, x, c, y) = E q φ (z,y|t,x,c) log p θ (t|x, c, z, y) As discussed in section 4.3, we modify attention and normalization layers in a transformer architecture for adding metric as a guide. Inspired by Zeng et al. (2020), we introduce the metric as follows: (1) Input embedding: The metric control y is directly added to the token and position embeddings of the input to the first transformer layer. This enables control by slanting the input representation towards the target metric. (2) Self-attention: In self-attention mechanism of transformers, each input token is weighted with respect to other positions in the input. For each token x t , query q t , key k t and value v t is calculated using learned weight matrices W Q , W K and W V respectively. The attention score for token x t is computed by a compatibility function of the corresponding query q t with the keys k i of other tokens and the attention vector is computed as the weighted average of these attention scores with the value vector v t . This can be written as where d k is the dimension of the key vector k t . We modify this attention calculation to introduce the control y by changing the query vector in the above equation to q t = η t (y), where η t denoted an affine transformation. Modifying the query vector according to the specific target metric allows for biasing attention weights towards the target and capturing target control in the context representation, which aids in targeted decoding and generation. (3) Layer Normalization: Classically, the layer normalization in transformers is calculated as where µ and σ are the mean and standard deviation of the elements in ν and γ and β are the scale and bias parameters. The metric control, y, is used to modulate hidden representations of the generative model via normalization layers. The scale and bias parameters in the layer normalization are replaced as functions of y, namely γ(y) and β(y) in the above equation. As discussed in Park et al. (2019) , normalization layer applied on input with same target control would wash away the target information captured in the input to normalization layer. Adding target control in the scale and bias parameter ensures that the control is preserved through the normalization layers of transformer. Training details. For fine-tuning, we prepend the input sentence with metric identifiers, to keep the input layer unchanged. We, then, extract the prepended metric token and use it to modify attention and normalization layers as described earlier. of generated text to the target metric class in the form of metric loss. 10 During inference, the generation is conditioned on the prompt, which is a combination of the topic and keywords. During training, the keywords and topic for the article is prepended to the input along with a {start of text} token. Thus, the input is {metric token}+{topic}+{start of keyword token}+{keywords}+{start of text to-ken}+{article text}. The keywords and topics are available for the NYT dataset for each article, and are extracted from input text using topic modeling (Blei et al., 2003) as described in next section. Webhose Covid-19 Dataset: We use the Webhose dataset available at https://webhose.io/freedatasets/news-articles-that-mention-corona-virus/ that has 410, 120 data points in total. We choose the subset of this dataset limited to English. To remove any outliers, we heuristically choose articles with word count more than 30 but less than 5000 words in the article. The data contains engagement on various news articles in form of participation count, replies count and various other social media likes and share metrics. The social media metrics includes PinInterest, LinkedIn, Google+ shares and like, shares and comments on Facebook. Most of these are very sparse in the dataset, for instance, less than ∼ 12k data points have Facebook comments as non-zero. Thus, we choose participation count and replies count as good indicators to the engagement on the article and use these as our target metrics. We consider only the articles with participation count > 1, leaving us with 39192 data points in 10 The computing infrastructure and hyper-parameter details are included in Appendix E total. The metric value for participation count and replies count vary from 1 − 297 and 0 − 5751 respectively with a mean and standard deviation of 14. 37, 27.90 and 129.91, 446.71 . To control for these metrics in our models, we convert these to categorical variable with the threshold of 2 and 21 for participation count. The low bucket is the largest bucket with least standard deviation in the value of metric; the medium and high categories have almost same number of data points as shown in Table 1 in the paper. Similarly for replies count, the threshold is 2 and 32 with equal size of medium and high categories. As mentioned earlier, the context for generative models includes keywords and topic of the article, that acts as "prompt" during inference stage. For webhose data, the keywords are not directly available in the dataset, NYT-comments dataset has keywords. We extract the keywords as top n (n = 10) words from the articles using TF-IDF vectors. The topics are extracted by topic modeling using Latent Dirichlet Allocation (LDA) (Blei et al., 2003) . We choose 20 topics with a seed of 23 and then represent the topic of each input article as the corresponding topic identifier ranging from 1-20. For transformer-based model, the keyword and topic tokens are added to the pre-trained tokenizer. The various textual features considered for causal effect are as listed in Table 4 . The average treatment effect on NYT data metrics -Comment count and Upvote count is as shown. Here, the significance level is empirically chosen as 0.01. Thus, features with |ATE| > 0.01 on comment count or upvote count y are included in the corresponding causal generative model. For Webhose data, we choose significance level of 0.1 and consider features with ATE of greater than 0.1 in magnitude as 'causally significant' features. The causal feature identification models are trained on a train-test split of 90-10, using a random seed 23 with stratified sampling over the outcome values, for over 10 epochs in batches of size of 5. For transformers, we use HuggingFace 11 implementation of GPT-2 and make the model and training changes as described in the paper. The hyper-parameters are kept the same as the original implementation for uniformity. For the loss term mentioned in equation 11 of the paper, we set λ G , λ metric , λ causal as 1. We train these models with a batch size of 2 for over 3 epochs. The training time over 4 GPUs was about 14 hours for webhose data and about 5 hours for NYT dataset. For the CVAE model, we use adam optimizer. We initiate the training with the learning rate of 0.001 with learning rate decay of 0.6. We train the models over 30 epochs with an early stopping criterion of 0.996 threshold. All the training experiments were run on a 4 GPU machine with 64-bit 16 core tesla v100 processor and 100 GB RAM. An introduction to propensity score methods for reducing the effects of confounding in observational studies Latent dirichlet allocation Generating sentences from a continuous space BERT: Pre-training of deep bidirectional transformers for language understanding Doubly Robust Estimation of Causal Effects Toward controlled generation of text Matthijs Douze, Hérve Jégou, and Tomas Mikolov. 2016. Fasttext.zip: Compressing text classification models CTRL: A Conditional Transformer Language Model for Controllable Generation. arXiv e-prints Adam: A method for stochastic optimization ROUGE: A package for automatic evaluation of summaries Conditional Variational Autoencoder for Neural Machine Translation. arXiv e-prints Semantic image synthesis with spatially-adaptive normalization Feature selection as causal inference: Experiments with text classification Language models are unsupervised multitask learners BLEURT: Learning robust metrics for text generation Neural machine translation of rare words with subword units Incorporating stylistic lexical preferences in generative language models Learning structured output representation using deep conditional generative models Policy gradient methods for reinforcement learning with function approximation User response driven content understanding with causal inference