GILE: A Generalized Input-Label Embedding for Text Classification Nikolaos Pappas James Henderson Idiap Research Institute, Martigny 1920, Switzerland {nikolaos.pappas,james.henderson@idiap.ch} Abstract Neural text classification models typically treat output labels as categorical variables which lack description and semantics. This forces their parametrization to be depen- dent on the label set size, and, hence, they are unable to scale to large label sets and generalize to unseen ones. Existing joint input-label text models overcome these is- sues by exploiting label descriptions, but they are unable to capture complex label re- lationships, have rigid parametrization, and their gains on unseen labels happen often at the expense of weak performance on the labels seen during training. In this paper, we propose a new input-label model which generalizes over previous such models, ad- dresses their limitations, and does not com- promise performance on seen labels. The model consists of a joint non-linear input- label embedding with controllable capacity and a joint-space-dependent classification unit which is trained with cross-entropy loss to optimize classification performance. We evaluate models on full-resource and low- or zero-resource text classification of multilin- gual news and biomedical text with a large label set. Our model outperforms monolin- gual and multilingual models which do not leverage label semantics and previous joint input-label space models in both scenarios. 1 Introduction Text classification is a fundamental NLP task with numerous real-world applications such as topic recognition (Tang et al., 2015; Yang et al., 2016), sentiment analysis (Pang and Lee, 2005; Yang et al., 2016), and question answering (Chen et al., 2015; Kumar et al., 2015). Classification also ap- pears as a sub task for sequence prediction tasks such as neural machine translation (Cho et al., 2014; Luong et al., 2015), and summarization (Rush et al., 2015). Despite the numerous stud- ies, existing models are trained on a fixed label set using k-hot vectors and, therefore, treat target labels as mere atomic symbols without any partic- ular structure to the space of labels, ignoring po- tential linguistic knowledge about the words used to describe the output labels. Given that seman- tic representations of words have been shown to be useful for representing the input, it is reason- able to expect that they are going to be useful for representing the labels as well. Previous work has leveraged knowledge from the label texts through a joint input-label space, initially for image classification (Weston et al., 2011; Mensink et al., 2012; Frome et al., 2013; Socher et al., 2013). Such models generalize to labels both seen and unseen during training, and scale well on very large label sets. However, as we explain in Section 2, existing input-label models for text (Yazdani and Henderson, 2015; Nam et al., 2016) have the following limitations: (i) their em- bedding does not capture complex label relation- ships due to its bilinear form, (ii) their output layer parametrization is rigid because it depends on the dimensionality of the encoded text and labels, and, (iii) they are outperformed on seen labels by clas- sification baselines trained with cross-entropy loss (Frome et al., 2013; Socher et al., 2013). In this paper, we propose a new joint input-label model which generalizes over previous such mod- els, addresses their limitations, and does not com- promise performance on seen labels (see Figure 1). The proposed model is comprised of a joint non-linear input-label embedding with control- lable capacity and a joint-space-dependent classi- fication unit which is trained with cross-entropy loss to optimize classification performance.1 The need for capturing complex label relationships is addressed by two non-linear transformations which have the same target joint space dimension- 1Our code is available at: github.com/idiap/gile github.com/idiap/gile ality. The parametrization of the output layer is not constrained by the dimensionality of the in- put or label encoding, but is instead flexible with a capacity which can be easily controlled by choos- ing the dimensionality of the joint space. Training is performed with cross-entropy loss, which is a suitable surrogate loss for classification problems, as opposed to a ranking loss such as WARP loss (Weston et al., 2010) which is more suitable for ranking problems. Evaluation is performed on full-resource and low- or zero-resource scenarios of two text clas- sification tasks, namely on biomedical semantic indexing (Nam et al., 2016) and on multilingual news classification (Pappas and Popescu-Belis, 2017) against several competitive baselines. In both scenarios, we provide a comprehensive abla- tion analysis which highlights the importance of each model component and the difference with previous embedding formulations when using the same type of architecture and loss function. Our main contributions are the following: (i) We identify key theoretical and practical lim- itations of existing joint input-label models. (ii) We propose a novel joint input-label embed- ding with flexible parametrization which gen- eralizes over the previous such models and addresses their limitations. (iii) We provide empirical evidence of the supe- riority of our model over monolingual and multilingual models which ignore label se- mantics, and over previous joint input-label models on both seen and unseen labels. The remainder of this paper is organized as fol- lows. Section 2 provides background knowledge and explains limitations of existing models. Sec- tion 3 describes the model components, training and relation to previous formulations. Section 4 describes our evaluation results and analysis, while Section 5 provides an overview of previous work and Section 6 concludes the paper and pro- vides future research directions. 2 Background: Neural Text Classification We are given a collection D = {(xi,yi), i = 1, . . . ,N} made of N documents, where each document xi is associated with labels yi = {yij ∈ {0, 1} | j = 1, . . . ,k}, and k is the total number of labels. Each document xi = {w11,w12, . . . ,wKiTKi} is a sequence of words grouped into sentences, with Ki being the num- ber of sentences in document i and Tj being the number of words in sentence j. Each label j has a textual description comprised of multiple words, cj = {cj1,cj2, . . . ,cjLj | j = 1, . . . ,k} with Lj being the number of words in each description. Given the input texts and their associated labels seen during the training portion of D, our goal is to learn a text classifier which is able to predict labels both in the seen, Ys, or unseen, Yu, label sets, defined as the sets of unique labels which have been seen or not during training respectively and, hence, Y ∩Yu = ∅ and Y = Ys ∪Yu.2 2.1 Input Text Representation To encode the input text, we focus on hierarchical attention networks (HANs), which are competitive for monolingual (Yang et al., 2016) and multilin- gual text classification (Pappas and Popescu-Belis, 2017). The model takes as input a document x and outputs a document vector h. The input words and label words are represented by vectors in IRd from the same3 embeddings E ∈ IR|V|×d, where V is the vocabulary and d is the embedding dimension; E can be pre-trained or learned jointly with the rest of the model. The model has two levels of abstraction, word and sentence. The word level is made of an encoder network gw and an attention network aw, while the sentence level similarly in- cludes an encoder and an attention network. Encoders. The function gw encodes the sequence of input words {wit | t = 1, . . . ,Ti} for each sen- tence i of the document, noted as: h(it)w = gw(wit), t ∈ [1,Ti], (1) and at the sentence level, after combining the in- termediate word vectors {h(it)w | t = 1, . . . ,Ti} to a sentence vector si ∈ IRdw (see below), where dw is the dimension of the word encoder, the func- tion gs encodes the sequence of sentence vectors {si | i = 1, . . . ,K}, noted as h (i) s . The gw and gs functions can be any feed-forward (DENSE) or recurrent networks, e.g. GRU (Cho et al., 2014). Attention. The αw and αs attention mechanisms, which estimate the importance of each hidden 2Note that depending on the number of labels per docu- ment the problem can be a multi-label or multi-class problem. 3This statement holds true for multilingual classification problems too if the embeddings are aligned across languages. state vector, are used to obtain the sentence si and document representation h respectively. The sen- tence vector is thus calculated as follows: si = Ti∑ t=1 α(it)w h (it) w = Ti∑ t=1 exp(v>ituw)∑ j exp(v > ijuw) h(it)w , (2) where vit = fw(h (it) w ) is a fully-connected net- work with Ww parameters. The document vector h ∈ IRdh , where dh is the dimension of the sen- tence encoder, is calculated similarly, by replacing uit with vi = fs(h (i) s ) which is a fully-connected network with Ws parameters, and uw with us, which are parameters of the attention functions. 2.2 Label Text Representation To encode the label text we use an encoder func- tion which takes as input a label description cj and outputs a label vector ej ∈ IRdc ∀j = 1, . . . , k. For efficiency reasons, we use a simple, parameter- free function to compute ej , namely the average of word vectors which describe label j, namely ej = 1 Lj ∑Lj t=1 cjt, and hence dc = d in this case. By stacking all these label vectors into a ma- trix, we obtain the label embedding E ∈ IR|Y|×d. In principle, we could also use the same encoder functions as the ones for input text, but this would increase the computation significantly; hence, we keep this direction as future work. 2.3 Output Layer parametrizations 2.3.1 Typical Linear Unit The most typical output layer, consists of a linear unit with a weight matrix W ∈ IRdh×|Y| and a bias vector b ∈ IR|Y| followed by a softmax or sigmoid activation function. Given the encoder’s hidden representation h with dimension size dh, the probability distribution of output y given input x is proportional to the following quantity: p(y|x) ∝ exp(W>h + b). (3) The parameters in W can be learned separately or be tied with the parameters of the embedding E by setting W = ET if the input dimension of W is restricted to be the same as that of the embedding E (d = dh) and each label is represented by a single word description i.e. when Y corresponds to V and E = E. In the latter case, Eq. 3 becomes: p(y|x) ∝ exp(Eh + b). (4) Either way, the parameters of such models are typ- ically learned with cross-entropy loss, which is suitable for classification problems. However, in both cases they cannot be applied to labels which are not seen during training, because each label has learned parameters which are specific to that label, so the parameters for unseen labels cannot be learned. We now turn our focus to a class of models which can handle unseen labels. 2.3.2 Bilinear Input-Label Unit Joint input-output embedding models can gener- alize from seen to unseen labels because the pa- rameters of the label encoder are shared. The previously proposed joint input-output embedding models by Yazdani and Henderson (2015) and Nam et al. (2016) are based on the following bi- linear ranking function f(·): f(x,y) = EWh, (5) where E ∈ IR|Y|×d is the label embedding and W ∈ IRd×dh is the bilinear embedding. This func- tion allows one to define the rank of a given label y with respect to x and is trained using hinge loss to rank positive labels higher than negative ones. But note that the use of this ranking loss means that they do not model the conditional probability, as do the traditional models above. Limitations. Firstly, the above formula can only capture linear relationships between encoded text (h) and label embedding (E) through W. We argue that the relationships between different labels are non-linear due to the complex interactions of the semantic relations across labels but also between labels and different encoded inputs. A more ap- propriate form for this purpose would include a non-linear transformation σ(·), e.g. with either: (a) σ(EW)︸ ︷︷ ︸ Label structure h or (b) E σ(Wh)︸ ︷︷ ︸ Input structure . (6) Secondly, it is hard to control their output layer capacity due to their bilinear form, which uses a matrix of parameters (W) whose size is bounded by the dimensionalities of the label embedding and the text encoding. Thirdly, their loss function optimizes ranking instead of classification perfor- mance and thus treats the ground-truth as a ranked list when in reality it consists of one or more inde- pendent labels. Summary. We hypothesize that these are the rea- sons why these models do not yet perform well on seen labels compared to models which make use of the typical linear unit, and they do not take full advantage of the structure of the problem when tested on unseen labels. Ideally, we would like to have a model which will address these issues and will combine the benefits from both the typical lin- ear unit and the joint input-label models. 3 The Proposed Output Layer Paramet- rization for Text Classification We propose a new output layer parametrization for neural text classification which is comprised of a generalized input-label embedding which captures the structure of the labels, the structure of the en- coded texts and the interactions between the two, followed by a classification unit which is indepen- dent of the label set size. The resulting model has the following properties: (i) it is able to cap- ture complex output structure, (ii) it has a flexi- ble parametrization which allows its capacity to be controlled, and (iii) it is trained with a classi- fication surrogate loss such as cross-entropy. The model is depicted in Figure 1. In this section, we describe the model in detail, showing how it can be trained efficiently for arbitrarily large label sets and how it is related to previous models. 3.1 A Generalized Input-Label Embedding Let gin(h) and gout(ej) be two non-linear projec- tions of the encoded input, i.e. the document h, and any encoded label ej, where ej is the jth row vector from the label embedding matrix E, which have the following form: e′j = gout(ej) = σ(ejU + bu) (7) h′ = gin(h) = σ(V h + bv), (8) where σ(·) is a nonlinear activation function such as ReLU or Tanh, the matrix U ∈ IRd×dj and bias bu ∈ IRdj are the linear projection of the labels, and the matrix V ∈ IRdj×dh and bias bv ∈ IRdj are the linear projection of the encoded input. Note that the projections for h′ and e′j could be high- rank or low-rank depending on their initial dimen- sions and the target joint space dimension. Also let E′ ∈ IR|Y|×dj be the matrix resulting from pro- jecting all the outputs ej to the joint space, i.e. gout(E). The conditional output probability distribution h' e2 U V ∧h' e1 Classification unit w y1 ∧y2 … Joint space h' ek yk ∧ … Label Encoder cj1 cj2 cjLj ... ei Input Encoder w11 w12 wKiTKi ... h Encoders dh x dj d x dj W or d em be dd ing s ' ' ' T Figure 1: Each encoded text and label are projected to a joint input-label multiplicative space, the output of which is processed by a classification unit with label- set-size independent parametrization. can now be re-written as: p(y|x) ∝ exp ( E′h′ ) ∝ exp ( gout(E)gin(h) ) ∝ exp ( σ(EU + bu)︸ ︷︷ ︸ Label Structure σ(V h + bv)︸ ︷︷ ︸ Input Structure ) . (9) Crucially, this function has no label-set-size de- pendent parameters, unlike W and b in Eq. 3. In principle, this parametrization can be used for both multi-class and multi-label problems by defining the exponential in terms of a softmax and sigmoid functions respectively. However, in this paper we will focus on the latter. 3.2 Classification Unit We require that our classification unit parameters depend only on the joint input-label space above. To represent the compatibility between any en- coded input text hi and any encoded label ej for this task, we define their joint representation based on multiplicative interactions in the joint space: g (ij) joint = gin(hi) �gout(ej), (10) where � is component-wise multiplication. The probability for hi to belong to one of the k known labels is modeled by a linear unit which maps any point in the joint space into a score which indicates the validity of the combination: p (ij) val = g (ij) jointw + b, (11) where w ∈ IRdj and b are a scalar variables. We compute the output of this linear unit for each known label which we would like to predict for a given document i, namely: P (i) val =   p (i1) val p (i2) val . . . p (ik) val   =   g (i1) jointw + b g (i2) jointw + b . . . g (ik) jointw + b   . (12) For each row, the higher the value the more likely the label is to be assigned to the document. To ob- tain valid probability estimates and be able to train with binary cross-entropy loss for multi-label clas- sification, we apply a sigmoid function as follows: ŷi = p̂(yi|xi) = 1 1 + e−P (i) val . (13) Summary. By adding the above changes to the general form of Eq. 9 the conditional probabil- ity p(yi|xi) is now proportional to the following quantity: exp ( σ(EU + bu)(σ(V h + bv) �w) + b ) . (14) Note that the number of parameters in this equa- tion is independent of the size of the label set, given that U, V , w and b depend only on dj, and k can vary arbitrarily. This allows the model to scale up to large label sets and generalize to un- seen labels. Lastly, the proposed output layer ad- dresses all the limitations of the previous models, as follows: (i) it is able to capture complex struc- ture in the joint input-output space, (ii) it provides a means to easily control its capacity dj, and (iii) it is trainable with cross-entropy loss. 3.3 Training Objectives The training objective for the multi-label classifi- cation task is based on binary cross-entropy loss. Assuming θ contains all the parameters of the model, the training loss is computed as follows: L(θ) = − 1 Nk N∑ i=1 k∑ j=1 H(yij, ŷij), (15) where H is the binary cross-entropy between the gold label yij and predicted label ŷij for a docu- ment i and a candidate label j. We handle multiple languages according to Fi- rat et al. (2016) and Pappas and Popescu-Belis (2017). Assuming that Θ = {θ1,θ2, ...,θM} are all the parameters required for each of the M lan- guages, we use a joint multilingual objective based on the sum of cross-entropy losses: L(Θ) = − 1 Z Ne∑ i M∑ l k∑ j=1 H(y(l)ij , ŷ (l) ij ), (16) where Z = NeMk with Ne being the num- ber of examples per epoch. At each iteration, a document-label pair for each language is sam- pled. In addition, multilingual models share a certain subset of the encoder parameters during training while the output layer parameters are kept language-specific, as described by Pappas and Popescu-Belis (2017). In this paper, we share most of the output layer parameters, namely the ones from the input-label space (U, V, bv, bu), and we keep only the classification unit parameters (w, b) language-specific. 3.4 Scaling Up to Large Label Sets For a very large number dj of joint-space di- mensions in our parametrization, the computa- tional complexity increases prohibitively because our projection requires a large matrix multiplica- tion between U and E, which depends on |Y|. In such cases, we resort to sampling-based training, by adopting the commonly used negative sampling method proposed by Mikolov et al. (2013). Let xi ∈ IRd and yik ∈ {0, 1} be an input-label pair and ŷik the output probabilities from our model (Eq. 14). By introducing the sets kpi and k n i , which contain the indices of the positive and negative la- bels respectively for the i-th input, the loss L(θ) in Eq. 15 can be re-written as follows: = − 1 Z N∑ i=1 k∑ j=1 [ yij log ŷij + ȳij log (1 − ŷij) ] = − 1 Z N∑ i=1 [ kpi∑ j=1 log ŷij + kni∑ j=1 log (1 − ŷij) ] , (17) where Z = Nk and ȳij is (1 − yij). To reduce the computational cost needed to evaluate ŷij for all the negative label set kni , we sample k ∗ la- bels from the negative label set with probability p = 1|kni | to create the set kni . This enables training on arbitrarily big label sets without increasing the computation required. By controlling the number of samples we can drastically speed up the train- ing time, as we demonstrate empirically in Sec- tion 4.2.2. Exploring more informative sampling methods, e.g. importance sampling, would be an interesting direction of future work. 3.5 Relation to Previous Parametrizations The proposed embedding form can be seen as a generalization over the input-label embeddings with a bilinear form, because its degenerate form is equivalent to the bilinear form of Eq. 5. In par- ticular, this can be simply derived if we set one of the two non-linear projection functions in the second line of Eq. 9 to be the identity function, e.g. gout(·) = I, set all biases to zero, and make the σ(.) activation function linear, as follows: σ(EU + bu)σ(V h + bv) = (EI) (V h) = EV h �, (18) where V by consequence has the same number of dimensions as W ∈ IRd×dh from the bilinear input-label embedding model of Eq. 5. 4 Experiments The evaluation is performed on large-scale biomedical semantic indexing using the BioASQ dataset, obtained by Nam et al. (2016), and on multilingual news classification using the DW cor- pus, which consists of eight language datasets ob- tained by Pappas and Popescu-Belis (2017). The statistics of these datasets are listed in Table 1. 4.1 Biomedical Text Classification We evaluate on biomedical text classification to demonstrate that our generalized input-label model scales to very large label sets and performs better than previous joint input-label models on both seen and unseen label prediction scenarios. 4.1.1 Settings We follow the exact evaluation protocol, data and settings of Nam et al. (2016), as described below. We use the BioASQ Task 3a dataset, which is a collection of scientific publications in biomedical research. The dataset contains about 12M docu- ments labeled with around 11 labels out of 27,455, which are defined according to the Medical Sub- ject Headings (MESH) hierarchy. The data was minimally pre-processed with tokenization, num- ber replacements (NUM), rare word replacements (UNK), and split with the provided script by year so that the training set includes all documents until 2004 and the ones from 2005 to 2015 were kept for the test set, this corresponded to 6,692,815 docu- ments for training and 4,912,719 for testing. For validation, a set of 100,000 documents were ran- domly sampled from the training set. We report the same ranking-based evaluation metrics as Nam et al. (2016), namely rank loss (RL), average pre- cision (AvgPr) and one-error loss (OneErr). Dataset Documents Labels abbrev. # count # words w̄d # count w̄l BioASQ 11,705,534 528,156 214 26,104 35.0 DW 598,304 884,272 436 5,637 2.3 – en 112,816 110,971 516 1,385 2.1 – de 132,709 261,280 424 1,176 1.8 – es 75,827 130,661 412 843 4.7 – pt 39,474 58,849 571 396 1.8 – uk 35,423 105,240 342 288 1.7 – ru 108,076 123,493 330 916 1.8 – ar 57,697 58,922 357 435 2.4 – fa 36,282 34,856 538 198 2.5 Table 1: Dataset statistics: #count is the number of documents, #words are the number of unique words in the vocabulary V, w̄d and w̄l are the average number of words per document and label respectively. Our hyper-parameters were selected on valida- tion data based on average precision as follows: 100-dimensional word embeddings, encoder, at- tention (same dimensions as the baselines), joint input-label embedding of 500, batch size of 64, maximum number of 300 words per document and 50 words per label, ReLU activation, 0.3% nega- tive label sampling, and optimization with ADAM until convergence. The word embeddings were learned end-to-end on the task.4 The baselines are the joint input-label models from Nam et al. (2016), noted as [N16], namely: • WSABIE+: This model is an extension of the original WSABIE model by Weston et al. (2011), which instead of learning a ranking model with fixed document features, it jointly learns features for documents and words, and is trained with the WARP ranking loss. • AiTextML: This model is the one proposed by Nam et al. (2016) with the purpose of learning jointly representations of docu- ments, labels and words, along with a joint input-label space which is trained with the WARP ranking loss. The scores of the WSABIE+ and AiTextML base- lines in Table 2 are the ones reported by Nam et al. (2016). In addition, we report scores of a word- level attention neural network (WAN) with DENSE encoder and attention followed by a sigmoid out- 4Here, the word embeddings are included in the parameter statistics because they are variables of the network. Model Layer form Dim Seen labels Unseen labels Params abbrev. output #count RL AvgPr OneErr RL AvgPr OneErr #count [N 16 ] WSABIE+ EWht 100 5.21 36.64 41.72 48.81 0.37 99.94 722.10M AiTextML avg EWht 100 3.54 32.78 25.99 52.89 0.39 99.94 724.47M AiTextML inf EWht 100 3.54 32.78 25.99 21.62 2.66 98.61 724.47M B as el in es WAN W>ht – 1.53 42.37 11.23 – – – 55.60M BIL-WAN [YH15] σ(EW)Wht 100 1.21 40.68 17.52 18.72 9.50 93.89 52.85M BIL-WAN [N16] EWht 100 1.12 41.91 16.94 16.26 10.55 93.23 52.84M O ur s GILE-WAN σ(EU)σ(V ht) 500 0.78 44.39 11.60 9.06 12.95 91.90 52.93M − constrained dj σ(EW)σ(Wht) 100 1.01 37.71 16.16 10.34 11.21 93.38 52.85M − only label (Eq. 6a) σ(EW)ht 100 1.06 40.81 13.77 9.77 14.71 90.56 52.84M − only input (Eq. 6b) Eσ(Wht) 100 1.07 39.78 15.67 19.28 7.18 95.91 52.84M Table 2: Biomedical semantic indexing results computed over labels seen and unseen during training, i.e. the full-resource versus zero-resource settings. Best scores among the competing models are marked in bold. put layer, trained with binary cross-entropy loss.5 Our model replaces WAN’s output layer with a generalized input-label embedding layer and its variations, noted GILE-WAN. For comparison, we also compare to bilinear input-label embedding versions of WAN for the model by Yazdani and Henderson (2015), noted as BIL-WAN [YH16], and the one by Nam et al. (2016), noted as BIL- WAN [N16]. Note that the AiTextML parameter space is huge and makes learning difficult for our models (linear wrt. labels and documents). In- stead, we make sure that our models have far fewer parameters than the baselines (Table 2). 4.1.2 Results The results on biomedical semantic indexing on seen and unseen labels are shown in Table 2. We observe that the neural baseline, WAN, outper- forms WSABIE+ and AiTextML on the seen la- bels, namely by +5.73 and +9.59 points in terms of AvgPr respectively. The differences are even more pronounced when considering the ranking loss and one error metrics. This result is compati- ble with previous findings that existing joint input- label models are not able to outperform strong su- pervised baselines on seen labels. However, WAN is not able to generalize at all to unseen labels, hence the WSABIE+ and AiTextML have a clear advantage in the zero-resource setting. In contrast, our generalized input-label model, 5 In our preliminary experiments, we also trained the neu- ral model with a hinge loss as WSABIE+ and AiTextML, but it performed similarly to them and much worse than WAN, so we did not further experiment with it. GILE-WAN, outperforms WAN even on seen la- bels, where our model has higher average preci- sion by +2.02 points, better ranking loss by +43% and comparable OneErr (−3%). And this gain is not at the expense of performance on unseen la- bels. GILE-WAN, outperforms WSABIE+, Ai- TextML variants6 by a large margin in both cases, e.g. by +7.75, +11.61 points on seen labels and by +12.58, +10.29 points in terms of average preci- sion on unseen labels, respectively. Interestingly, our GILE-WAN model also outperforms the two previous bilinear input-label embedding formula- tions of Yazdani and Henderson (2015) and Nam et al. (2016), namely BIL-WAN [YH15] and BIL- WAN [N16] by +3.71, +2.48 points on seen la- bels and +3.45 and +2.39 points on unseen la- bels, respectively, even when they are trained with the same encoders and loss as ours. These mod- els are not able to outperform the WAN baseline when evaluated on the seen labels, namely they have −1.68 and −0.46 points lower average preci- sion than WAN, but they outperform WSABIE+ and AiTextML on both seen and unseen labels. Overall, the results show a clear advantage of our generalized input-label embedding model against previous models on both seen and unseen labels. 4.1.3 Ablation Analysis To evaluate the effectiveness of individual com- ponents of our model, we performed an ablation study (last three rows in Table 2). Note that when we use only the label or only the input embedding 6Namely, avg when using the average of word vectors and inf when using inferred label vectors to make predictions. in our generalized input-label formulation, the di- mensionality of the joint space is constrained to be the dimensionality of the encoded labels and inputs respectively, that is dj=100 in our experi- ments. All three variants of our model outperform previous embedding formulations of Nam et al. (2016) and Yazdani and Henderson (2015) in all metrics except from AvgPr on seen labels where they score slightly lower. The decrease in AvgPrec for our model variants with dj=100 compared to the neural baselines could be attributed to the dif- ficulty in learning the parameters of a highly non- linear space with only a few hidden dimensions. Indeed, when we increase the number of dimen- sions (dj=500), our full model outperforms them by a large margin. Recall that this increase in ca- pacity is only possible with our full model defini- tion in Eq. 9 and none of the other variants allow us to do this without interfering with the original dimensionality of the encoded labels (E) and input (ht). In addition, our model variants with dj=100 exhibit consistently higher scores than baselines in terms of most metrics on both seen and unseen la- bels, which suggests that they are able to capture more complex relationships across labels and be- tween encoded inputs and labels. Overall, the best performance among our model variants is achieved when using only the label em- bedding and, hence, it is the most significant com- ponent of our model. Surprisingly, our model with only the label embedding achieves higher perfor- mance than our full model on unseen labels but it is far behind our full model when we consider per- formance on both seen and unseen labels. When we constrain our full model to have the same di- mensionality with the other variants, i.e. dj=100, it outperforms the one that uses only the input em- bedding in most metrics and it is outperformed by the one that uses only the label embedding. 4.2 Multilingual News Text Classification We evaluate on multilingual news text classifica- tion to demonstrate that our output layer based on the generalized input-label embedding outper- forms previous models with a typical output layer in a wide variety of settings, even for labels which have been seen during training. 4.2.1 Settings We follow the exact evaluation protocol, data and settings of Pappas and Popescu-Belis (2017), as described below. The dataset is split per language into 80% for training, 10% for validation and 10% for testing. We evaluate on both types of labels (general Yg, and specific Ys) in a full-resource scenario, and we evaluate only on the general la- bels (Yg) in a low-resource scenario. Accuracy is measured with the micro-averaged F1 percentage scores. The word embeddings for this task are the aligned pre-trained 40-dimensional multi-CCA multilingual word embeddings by Ammar et al. (2016) and are kept fixed during training.7 The sentences are already truncated at a length of 30 words and the documents at a length of 30 sen- tences. The hyper-parameters were selected on validation data as follows: 100-dimensional en- coder and attention, ReLU activation, batch size of 16, epoch size of 25k, no negative sampling (all labels are used) and optimization with ADAM until convergence. To ensure equal capacity to baselines, we use approximately the same number of parameters ntot with the baseline classification layers, by setting: dj ' dh ∗ |k(i)| dh + d , i = 1, . . . ,M, (19) in the monolingual case, and similarly, dj ' (dh ∗ ∑M i=1 |k (i)|)/(dh + d) in the multilingual case, where k(i) is the number of labels in lan- guage i. The hierarchical models have Dense encoders in all scenarios (Tables 3, 6, and 7), except from the varying encoder experiment (Table 4). For the low-resource scenario, the levels of data avail- ability are: tiny from 0.1% to 0.5%, small from 1% to 5% and medium from 10% to 50% of the original training set. For each level, the aver- age F1 across discrete increments of 0.1, 1 and 10 are reported respectively. The decision thresh- olds, which were tuned on validation data by Pap- pas and Popescu-Belis (2017), are set as follows: for the full-resource scenario it is set to 0.4 for |Ys| < 400 and 0.2 for |Ys| ≥ 400, and for the low-resource scenario it is set to 0.3 for all sets. The baselines are all the monolingual and multi- lingual neural networks from Pappas and Popescu- Belis (2017)8, noted as [PB17], namely: 7The word embeddings are not included in the parameters statistics because they are not variables of the network. 8For reference, in Table 4 we also compare to a logistic regression trained with unigrams over the full vocabulary and Models Languages (en + aux → en) Languages (en + aux → aux) Stat. Yg abbrev. de es pt uk ru ar fa de es pt uk ru ar fa avg [P B 17 ] M on o NN (Avg) 50.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.1 70.0 57.2 80.9 59.3 64.4 66.6 57.6 HNN (Avg) 70.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67.9 82.5 70.5 86.8 77.4 79.0 76.6 73.6 HAN (Att) 71.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.8 82.8 71.3 85.3 79.8 80.5 76.6 74.7 M ul ti MHAN-Enc 71.0 69.9 69.2 70.8 71.5 70.0 71.3 69.7 82.9 69.7 86.8 80.3 79.0 76.0 74.1 MHAN-Att 74.0 74.2 74.1 72.9 73.9 73.8 73.3 72.5 82.5 70.8 87.7 80.5 82.1 76.3 76.3 MHAN-Both 72.8 71.2 70.5 65.6 71.1 68.9 69.2 70.4 82.8 71.6 87.5 80.8 79.1 77.1 74.2 O ur s M on o GILE-NN (Avg) 60.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60.3 76.6 62.1 82.0 65.7 77.4 68.6 65.2 GILE-HNN (Avg) 74.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3 83.3 72.6 88.3 81.5 81.9 77.1 77.1 GILE-HAN (Att) 76.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74.2 83.4 71.9 86.1 82.7 81.0 77.2 78.0 M ul ti GILE-MHAN-Enc 75.1 74.0 72.7 70.7 74.4 73.5 73.2 72.7 83.4 73.0 88.7 82.8 83.3 77.4 76.7 GILE-MHAN-Att 76.5 76.5 76.3 75.3 76.1 75.6 75.2 74.5 83.5 72.7 88.0 83.4 82.1 76.7 78.0 GILE-MHAN-Both 75.3 73.7 72.1 67.2 72.5 73.8 69.7 72.6 84.0 73.5 89.0 81.9 82.0 77.7 76.0 Ys Models de es pt uk ru ar fa de es pt uk ru ar fa avg [P B 17 ] M on o NN (Avg) 24.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.8 22.1 24.3 33.0 26.0 24.1 32.1 25.3 HNN (Avg) 39.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39.6 37.9 33.6 42.2 39.3 34.6 43.1 38.9 HAN (Att) 43.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44.8 46.3 41.9 46.4 45.8 41.2 49.4 44.2 M ul ti MHAN-Enc 45.4 45.9 44.3 41.1 42.1 44.9 41.0 43.9 46.2 39.3 47.4 45.0 37.9 48.6 43.8 MHAN-Att 46.3 46.0 45.9 45.6 46.4 46.4 46.1 46.5 46.7 43.3 47.9 45.8 41.3 48.0 45.8 MHAN-Both 45.7 45.6 41.5 41.2 45.6 44.6 43.0 45.9 46.4 40.3 46.3 46.1 40.7 50.3 44.5 O ur s M on o GILE-NN (Avg) 27.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27.5 28.4 29.2 36.8 31.6 32.1 35.6 29.5 GILE-HNN (Avg) 43.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.4 42.0 37.7 43.0 42.9 36.6 44.1 42.2 GILE-HAN (Att) 45.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47.3 47.4 42.6 46.6 46.9 41.9 48.6 45.9 M ul ti GILE-MHAN-Enc 46.0 46.6 41.2 42.5 46.4 43.4 41.8 47.2 47.7 41.5 49.5 46.6 41.4 50.7 45.1 GILE-MHAN-Att 47.3 47.0 45.8 45.5 46.2 46.5 45.5 47.6 47.9 43.5 49.1 46.5 42.2 50.3 46.5 GILE-MHAN-Both 47.0 46.7 42.8 42.0 45.6 42.8 39.3 48.0 47.6 43.1 48.5 46.0 42.1 49.0 45.0 Table 3: Full-resource classification results on general (upper half) and specific (lower half) labels using mono- lingual and bilingual models with DENSE encoders on English as target (left) and the auxiliary language as target (right). The average bilingual F1-score (%) is noted avg and the top ones per block are underlined. The monolin- gual scores on the left come from a single model, hence a single score is repeated multiple times; the repetition is marked with consecutive dots. • NN: A neural network which feeds the av- erage vector of the input words directly to a classification layer, as the one used by Kle- mentiev et al. (2012). • HNN: A hierarchical network with encoders and average pooling at every level, followed by a classification layer, as the one used by Tang et al. (2015). • HAN: A hierarchical network with encoders and attention, followed by a classification layer, as the one used by Yang et al. (2016). • MHAN: Three multilingual hierarchical net- works with shared encoders, noted MHAN- Enc, shared attention, noted MHAN-Att, and shared attention and encoders, noted MHAN-Both, as the ones used by Pappas and Popescu-Belis (2017). To ensure a controlled comparison to the above baselines, for each model we evaluate a ver- sion where their output layer is replaced by our generalized input-label embedding output layer over the top-10% most frequent words by Mrini et al. (2017), noted as [M17], which use the same settings and data. using the same number of parameters; these have the abbreviation “GILE” prepended in their name (e.g. GILE-HAN). The scores of HAN and MHAN models in Tables 3, 6 and 7 are the ones re- ported by Pappas and Popescu-Belis (2017), while for Table 4 we train them ourselves using their code. Lastly, the best score for each pairwise com- parison between a joint input-label model and its counterpart is marked in bold. 4.2.2 Results Table 3 displays the results of full-resource docu- ment classification using DENSE encoders for both general and specific labels. On the left, we display the performance of models on the English sub- corpus when English and an auxiliary language are used for training, and on the right, the perfor- mance on the auxiliary language sub-corpus when that language and English are used for training. The results show that in 98% of comparisons on general labels (top half of Table 3) the joint input- label models improve consistently over the cor- responding models using a typical sigmoid clas- sification layer. This finding validates our main hypothesis that the joint input-label models suc- Models Languages Statistics abbrev. en de es pt uk ru ar fa nl fl [M 17 ] LogReg-BOW 75.8 72.9 81.4 74.3 91.0 79.2 82.0 77.0 26M 79.19 LogReg-BOW-10% 74.7 70.1 80.6 71.1 89.5 76.5 80.8 75.5 5M 77.35 [P B 17 ] HAN-BIGRU 76.3 74.1 84.5 72.9 87.7 82.9 81.7 75.3 377K 79.42 HAN-GRU 77.1 72.5 84.0 70.8 86.6 83.0 82.9 76.0 138K 79.11 HAN-DENSE 71.2 71.8 82.8 71.3 85.3 79.8 80.5 76.6 50K 77.41 O ur s GILE-HAN-BIGRU 78.1 73.6 84.9 72.5 89.0 82.4 82.5 75.8 377K 79.85 GILE-HAN-GRU 77.1 72.6 84.7 72.4 88.6 83.6 83.4 76.0 138K 79.80 GILE-HAN-DENSE 76.5 74.2 83.4 71.9 86.1 82.7 82.6 77.2 50K 79.12 Table 4: Full-resource classification results on general (Yg) topic labels with DENSE and GRU encoders. Reported are also the average number of parameters per language (nl), and the average F1 per language (fl). cessfully exploit the semantics of the labels, which provide useful cues for classification, as opposed to models which are agnostic to label semantics. The results for specific labels (bottom half of Ta- ble 3) demonstrate the same trend, with the joint input-label models performing better in 87% of comparisons. In Table 5, we also directly compare our em- bedding to previous bilinear input-label embed- ding formulations when using the best monolin- gual configuration (HAN) from Table 3, exactly as done in Section 4.1. The results on the general labels show that GILE outperforms the previous bilinear input-label models, BIL [YH15] and BIL [N16], by +1.62 and +3.3 percentage points on av- erage respectively. This difference is much more pronounced on the specific labels, where the label set is much larger, namely +6.5 and +13.5 percent- age points respectively. Similarly, our model with constrained dimensionality is also as good or bet- ter on average than the bilinear input-label models, by +0.9 and +2.2 on general labels and by -0.5 and +6.1 on specific labels respectively, which high- lights the importance of learning non-linear rela- tionships across encoded labels and documents. Among our ablated model variants, as in previous section, the best is the one with only the label pro- jection but it still worse than our full model by - 5.2 percentage points. The improvements of GILE against each baseline is significant and consistent on both datasets. Hence, in the following experi- ments we will only consider the best of these al- ternatives. The best bilingual performance on average is that of the GILE-MHAN-Att model, for both gen- eral and specific labels. This improvement can be attributed to the effective sharing between la- HAN Languages Yg output layer en de es pt uk ru ar fa Linear [PB17] 71.2 71.8 82.8 71.3 85.3 79.8 80.5 76.6 BIL [YH15] 71.7 70.5 82.0 71.1 86.6 80.6 80.4 76.0 BIL [N16] 69.8 69.1 80.9 67.4 87.5 79.9 78.4 75.1 GILE (Ours) 76.5 74.2 83.4 71.9 86.1 82.7 82.6 77.2 - constrained dj 73.6 73.1 83.3 71.0 87.1 81.6 80.4 76.4 - only label 71.4 69.6 82.1 70.3 86.2 80.6 81.1 76.2 - only input 55.1 54.2 80.6 66.5 85.6 60.8 78.9 74.0 Ys output layer en de es pt uk ru ar fa Linear[PB17] 43.4 44.8 46.3 41.9 46.4 45.8 41.2 49.4 BIL [YH15] 40.7 37.8 38.1 33.5 44.6 38.1 39.1 42.6 BIL [N16] 34.4 30.2 34.4 33.6 31.4 22.8 35.6 38.9 GILE (Ours) 45.9 47.3 47.4 42.6 46.6 46.9 41.9 48.6 - constrained dj 38.5 38.0 36.8 35.1 42.1 36.1 36.7 48.7 - only label 38.4 41.5 42.9 38.3 44.0 39.3 37.2 43.4 - only input 12.1 10.8 8.8 20.5 11.8 7.8 12.0 24.6 Table 5: Direct comparison with previous bilin- ear input-label models, namely BIL [YH15] and BIL [N16], and with our ablated model variants using the best monolingual configuration (HAN) from Table 3 on both general (upper half) and specific (lower half) labels. Best scores among the competing models are marked in bold. bel semantics across languages through the joint multilingual input-label output layer. Effectively, this model has the same multilingual sharing scheme with the best model reported by Pappas and Popescu-Belis (2017), MHAN-Att, namely sharing attention at each level of the hierarchy, which agrees well with their main finding. In- terestingly, the improvement holds when using different types of hierarchical encoders, namely Models General labels Specific labels abbrev. # lang. nl fl nl fl [P B 17 ] HAN 1 50K 77.41 90K 44.90 MHAN 2 40K 78.30 80K 45.72 MHAN 8 32K 77.91 72K 45.82 O ur s GILE-HAN 1 50K 79.12 90K 45.90 GILE-MHAN 2 40K 79.68 80K 46.49 GILE-MHAN 8 32K 79.48 72K 46.32 Table 6: Multilingual learning results. The columns are the average number of parameters per language (nl), average F1 per language (fl). DENSE GRU, and biGRU, as shown in Table 4, which demonstrate the generality of the approach. In addition, our best models outperform logis- tic regression trained either on top-10% most fre- quent words or on the full vocabulary, even though our models utilize many fewer parameters, namely 377K/138K vs. 26M/5M. Increasing the capacity of our models should lead to even further improve- ments. Multilingual learning. So far, we have shown that the proposed joint input-label models outper- form typical neural models when training with one and two languages. Does the improvement remain when increasing the number of languages even more? To answer the question we report in Table 6 the average F1-score per language for the best baselines from the previous experiment (HAN and MHAN-Att) with the proposed joint input-label versions of them (GILE-HAN and GILE-MHAN- Att) when increasing the number of languages (1, 2 and 8) that are used for training. Overall, we ob- serve that the joint input-label models outperform all the baselines independently of the number of languages involved in the training, while having the same number of parameters. We also replicate the previous result that a second language helps but beyond that there is no improvement. Low-resource transfer. We investigate here whether joint input-label models are useful for low-resource languages. Table 7 shows the low- resource classification results from English to seven other languages when varying the amount of their training data. Our model with both shared en- coders and attention, GILE-MHAN, outperforms previous models in average, namely HAN (Yang et al., 2016) and MHAN (Pappas and Popescu- Belis, 2017), for low-resource classification in the majority of the cases. Levels [PB17] Ours range HAN MHAN GILE-MHAN en → de 0.1-0.5% 29.9 39.4 42.9 1-5% 51.3 52.6 51.6 10-50% 63.5 63.8 65.9 en → es 0.1-0.5% 39.5 41.5 39.0 1-5% 45.6 50.1 50.9 10-50% 74.2 75.2 76.4 en → pt 0.1-0.5% 30.9 33.8 39.6 1-5% 44.6 47.3 48.9 10-50% 60.9 62.1 62.3 en → uk 0.1-0.5% 60.4 60.9 61.1 1-5% 68.2 69.0 69.4 10-50% 76.4 76.7 76.5 en → ru 0.1-0.5% 27.6 29.1 27.9 1-5% 39.3 40.2 40.2 10-50% 69.2 69.4 70.4 en → ar 0.1-0.5% 35.4 36.6 46.1 1-5% 45.6 46.6 49.5 10-50% 48.9 47.8 61.8 en → fa 0.1-0.5% 36.0 41.3 42.5 1-5% 55.0 55.5 55.4 10-50% 69.2 70.0 69.7 Table 7: Low-resource classification results with vari- ous sizes of training data using the general labels. The shared input-label space appears to be help- ful especially when transferring from English to German, Portuguese and Arabic languages. GILE- MHAN is significantly behind MHAN on trans- ferring knowledge from English to Spanish and to Russian in the 0.1-0.5% resource setting, but in the rest of the cases they have very similar scores. Label sampling. To speedup computation it is possible to train our model by sampling labels, in- stead of training over the whole label set. How much speed-up can we achieve from this label sampling approach and still retain good levels of performance? In Figure 2, we attempt to answer this question by reporting the performance of our GILE-HNN model when varying the amount of labels (%) that it uses for training over English general and specific labels of the DW dataset. In both cases, the performance of GILE-HNN tends to increase as the percentage of labels sampled in- creases, but it levels off for the higher percentages. For general labels, top performance is reached with a 40% to 50% sampling rate, which translates to a 22% to 18% speedup, while for the specific labels, it is reached with a 60% to 70% sampling rate, which translates to a 40% to 36% speedup. Figure 2: Varying sampling percentage for general and specific English labels. (Top) GILE-HNN is compared against HNN in terms of F1 (%). (Bottom) The runtime speedup over GILE-HNN trained on the full label set. The speedup is correlated to the size of the label set, since there are many fewer general labels than specific labels, namely 327 vs 1,058 here. Hence, we expect even higher speedups for bigger label sets. Interestingly, GILE-HNN with label sam- pling reaches the performance of the baseline with a 25% and 60% sample for general and specific labels respectively. This translates to a speedup of 30% and 50% respectively compared to a GILE- HNN trained over all labels. Overall, these results show that our model is effective and that it can also scale to large label sets. The label sampling should also be useful in tasks where the computation re- sources may be limited or budgeted. 5 Related Work 5.1 Neural text Classification Research in neural text classification was initially based on feed-forward networks, which required unsupervised pre-training (Collobert et al., 2011; Mikolov et al., 2013; Le and Mikolov, 2014) and later on they focused on networks with hierarchi- cal structure. Kim (2014) proposed a convolu- tional neural network (CNN) for sentence clas- sification. Johnson and Zhang (2015) proposed a CNN for high-dimensional data classification, while Zhang et al. (2015) adopted a character-level CNN for text classification. Lai et al. (2015) pro- posed a recurrent CNN to capture sequential infor- mation, which outperformed simpler CNNs. Lin et al. (2015) and Tang et al. (2015) proposed hi- erarchical recurrent neural networks and showed that they were superior to CNN-based models. Yang et al. (2016) demonstrated that a hierarchi- cal attention network with bi-directional gated en- coders outperforms previous alternatives. Pappas and Popescu-Belis (2017) adapted such networks to learn hierarchical document structures with shared components across different languages. The issue of scaling to large label sets has been addressed previously by output layer approx- imations (Morin and Bengio, 2005) and with the use of sub-word units or character-level modeling (Sennrich et al., 2016; Lee et al., 2017) which is mainly applicable to structured prediction prob- lems. Despite the numerous studies, most of the existing neural text classification models ignore label descriptions and semantics. Moreover, they are based on typical output layer parametrizations which are dependent on the label set size, and thus are not able to scale well to large label sets nor to generalize to unseen labels. Our output layer parametrization addresses these limitations and could potentially improve such models. 5.2 Output Representation Learning There exist studies which aim to learn output rep- resentations directly from data without any se- mantic grounding to word embeddings (Srikumar and Manning, 2014; Yeh et al., 2018; Augen- stein et al., 2018). Such methods have a label- set-size dependent parametrization, which makes them data hungry, less scalable on large label sets and incapable of generalizing to unseen classes. Wang et al. (2018) addressed the lack of seman- tic grounding to word embeddings by proposing an efficient method based on label-attentive text representations which are helpful for text classi- fication. However, in contrast to our study, their parametrization is still label-set-size dependent and thus their model is not able to scale well to large label sets nor to generalize to unseen labels. 5.3 Zero-shot Text Classification Several studies have focused on learning joint input-label representations grounded to word se- mantics for unseen label prediction for images (Weston et al., 2011; Socher et al., 2013; Norouzi et al., 2014; Zhang et al., 2016; Fu et al., 2018), called zero-shot classification. However, there are fewer such studies for text classification. Dauphin et al. (2014) predicted semantic utterances of text by mapping them in the same semantic space with the class labels using an unsupervised learning ob- jective. Yazdani and Henderson (2015) proposed a zero-shot spoken language understanding model based on a bilinear input-label model able to gen- eralize to previously unseen labels. Nam et al. (2016), proposed a bilinear joint document-label embedding which learns shared word representa- tions between documents and labels. More re- cently, Shu et al. (2017) proposed an approach for open-world classification which aims to identify novel documents during testing but it is not able to generalize to unseen classes. Perhaps, the most similar model to ours is from the recent study by Pappas et al. (2018) on neural machine translation, with the difference that they have single-word la- bel descriptions and they use a label-set-dependent bias in a softmax linear prediction unit, which is designed for structured prediction. Hence, their model can neither handle unseen labels nor multi- label classification, as we do here. Compared to previous joint input-label models, the proposed model has a more general and flexi- ble parametrization which allows the output layer capacity to be controlled. Moreover, it is not re- stricted to linear mappings, which have limited expressivity, but uses nonlinear mappings, similar to energy-based learning networks (LeCun et al., 2006; Belanger and McCallum, 2016). The link to the latter can be made if we regard P(ij)val in Eq. 11 as an energy function for the i-th document and the j-th label, the calculation of which uses a simple multiplicative transformation (Eq. 10). Lastly, the proposed model performs well on both seen and unseen label sets by leveraging the binary cross- entropy loss, which is the standard loss for classi- fication problems, instead of a ranking loss. 6 Conclusion We proposed a novel joint input-label embedding model for neural text classification which gener- alizes over existing input-label models and ad- dresses their limitations while preserving high per- formance on both seen and unseen labels. Com- pared to baseline neural models with a typical out- put layer, our model is more scalable and has bet- ter performance on the seen labels. Compared to previous joint input-label models, it performs sig- nificantly better on unseen labels without compro- mising performance on the seen labels. These im- provements can be attributed to the the ability of our model to capture complex input-label relation- ships, to its controllable capacity and to its training objective which is based on cross-entropy loss. As future work, the label representation could be learned by a more sophisticated encoder, and the label sampling could benefit from importance sampling to avoid revisiting uninformative labels. Another interesting direction would be to find a more scalable way of increasing the output layer capacity, for instance using a deep rather than wide classification network. Moreover, adapting the proposed model to structured prediction, for instance by using a softmax classification unit in- stead of a sigmoid one, would benefit tasks such as neural machine translation, language model- ing and summarization, in isolation but also when trained jointly with multi-task learning. Acknowledgments We are grateful for the support from the Euro- pean Union through its Horizon 2020 program in the SUMMA project n. 688139, see http: //www.summa-project.eu. We would also like to thank our action editor, Eneko Agirre, and the anonymous reviewers for their invaluable sug- gestions and feedback. References Waleed Ammar, George Mulcaire, Yulia Tsvetkov, Guillaume Lample, Chris Dyer, and Noah A. Smith. 2016. Massively multilingual word em- beddings. CoRR, abs/1602.01925.v2. Isabelle Augenstein, Sebastian Ruder, and An- ders Søgaard. 2018. Multi-task learning of pairwise sequence classification tasks over dis- parate label spaces. In Proceedings of the 2018 Conference of the North American Chap- ter of the Association for Computational Lin- guistics: Human Language Technologies, Vol- ume 1 (Long Papers), pages 1896–1906, New Orleans, Louisiana. David Belanger and Andrew McCallum. 2016. Structured prediction energy networks. In Pro- ceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceed- ings of Machine Learning Research, pages 983– 992, New York, New York, USA. PMLR. Jianshu Chen, Ji He, Yelong Shen, Lin Xiao, Xi- aodong He, Jianfeng Gao, Xinying Song, and Li Deng. 2015. End-to-end learning of LDA by mirror-descent back propagation over a deep architecture. In Advances in Neural Informa- tion Processing Systems 28, pages 1765–1773, Montreal, Canada. http://www.summa-project.eu http://www.summa-project.eu https://arxiv.org/abs/1602.01925.v2 https://arxiv.org/abs/1602.01925.v2 http://www.aclweb.org/anthology/N18-1172 http://www.aclweb.org/anthology/N18-1172 http://www.aclweb.org/anthology/N18-1172 http://proceedings.mlr.press/v48/belanger16.html https://papers.nips.cc/paper/5967-end-to-end-learning-of-lda-by-mirror-descent-back-propagation-over-a-deep-architecture https://papers.nips.cc/paper/5967-end-to-end-learning-of-lda-by-mirror-descent-back-propagation-over-a-deep-architecture https://papers.nips.cc/paper/5967-end-to-end-learning-of-lda-by-mirror-descent-back-propagation-over-a-deep-architecture Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder–decoder for statistical machine transla- tion. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Pro- cessing, pages 1724–1734, Doha, Qatar. Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (al- most) from scratch. Journal of Machine Learn- ing Research, 12:2493–2537. Yann N. Dauphin, Gökhan Tür, Dilek Hakkani- Tür, and Larry P. Heck. 2014. Zero-shot learn- ing and clustering for semantic utterance classi- fication. In International Conference on Learn- ing Representations, Banff, Canada. Orhan Firat, Baskaran Sankaran, Yaser Al- Onaizan, Fatos T. Yarman Vural, and Kyunghyun Cho. 2016. Zero-resource translation with multi-lingual neural machine translation. In Proceedings of the 2016 Con- ference on Empirical Methods in Natural Language Processing, pages 268–277, Austin, USA. Andrea Frome, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc Aurelio Ran- zato, and Tomas Mikolov. 2013. DeViSE: A deep visual-semantic embedding model. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, edi- tors, Advances in Neural Information Process- ing Systems 26, pages 2121–2129. Curran As- sociates, Inc. Yanwei Fu, Tao Xiang, Yu-Gang Jiang, Xi- angyang Xue, Leonid Sigal, and Shaogang Gong. 2018. Recent advances in zero-shot recognition: Toward data-efficient understand- ing of visual content. IEEE Signal Processing Magazine, 35(1):112–125. Rie Johnson and Tong Zhang. 2015. Effective use of word order for text categorization with convolutional neural networks. In Proceedings of the 2015 Conference of the North Ameri- can Chapter of the Association for Computa- tional Linguistics: Human Language Technolo- gies, pages 103–112, Denver, Colorado. Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 1746– 1751, Doha, Qatar. Alexandre Klementiev, Ivan Titov, and Binod Bhattarai. 2012. Inducing crosslingual dis- tributed representations of words. In Pro- ceedings of COLING 2012, pages 1459–1474, Mumbai, India. Ankit Kumar, Ozan Irsoy, Jonathan Su, James Bradbury, Robert English, Brian Pierce, Pe- ter Ondruska, Ishaan Gulrajani, and Richard Socher. 2015. Ask me anything: Dynamic memory networks for natural language process- ing. In Proceedings of The 33rd International Conference on Machine Learning, pages 334– 343, New York City, USA. Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015. Recurrent convolutional neural networks for text classification. In Proceedings of the 29th AAAI Conference on Artificial Intelligence, pages 2267–2273, Austin, USA. Quoc V. Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In Proceedings of The 31st International Confer- ence on Machine Learning, pages 1188âĂŞ– 1196, Beijing, China. Yann LeCun, Sumit Chopra, Raia Hadsell, Fu Jie Huang, and et al. 2006. A tutorial on energy- based learning. In Predicting Structured Data. MIT Press. Jason Lee, Kyunghyun Cho, and Thomas Hof- mann. 2017. Fully character-level neural ma- chine translation without explicit segmentation. Transactions of the Association for Computa- tional Linguistics, 5:365–378. Rui Lin, Shujie Liu, Muyun Yang, Mu Li, Ming Zhou, and Sheng Li. 2015. Hierarchical recur- rent neural network for document modeling. In Proceedings of the 2015 Conference on Empir- ical Methods in Natural Language Processing, pages 899–907, Lisbon, Portugal. Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. In https://aclanthology.coli.uni-saarland.de/papers/D14-1179/d14-1179 https://aclanthology.coli.uni-saarland.de/papers/D14-1179/d14-1179 https://aclanthology.coli.uni-saarland.de/papers/D14-1179/d14-1179 http://www.jmlr.org/papers/v12/collobert11a.html http://www.jmlr.org/papers/v12/collobert11a.html http://arxiv.org/abs/1401.0509 http://arxiv.org/abs/1401.0509 http://arxiv.org/abs/1401.0509 https://aclweb.org/anthology/D16-1026 https://aclweb.org/anthology/D16-1026 https://aclweb.org/anthology/D16-1026 http://papers.nips.cc/paper/5204-devise-a-deep-visual-semantic-embedding-model.pdf http://papers.nips.cc/paper/5204-devise-a-deep-visual-semantic-embedding-model.pdf https://doi.org/10.1109/MSP.2017.2763441 https://doi.org/10.1109/MSP.2017.2763441 https://doi.org/10.1109/MSP.2017.2763441 https://aclanthology.coli.uni-saarland.de/papers/N15-1011/n15-1011 https://aclanthology.coli.uni-saarland.de/papers/N15-1011/n15-1011 https://aclanthology.coli.uni-saarland.de/papers/N15-1011/n15-1011 https://aclanthology.info/papers/D14-1181/d14-1181 https://aclanthology.info/papers/D14-1181/d14-1181 http://www.aclweb.org/anthology/C12-1089 http://www.aclweb.org/anthology/C12-1089 http://proceedings.mlr.press/v48/kumar16.html http://proceedings.mlr.press/v48/kumar16.html http://proceedings.mlr.press/v48/kumar16.html https://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/view/9745/9552 https://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/view/9745/9552 http://proceedings.mlr.press/v32/le14.html http://proceedings.mlr.press/v32/le14.html http://yann.lecun.com/exdb/publis/pdf/lecun-06.pdf http://yann.lecun.com/exdb/publis/pdf/lecun-06.pdf https://aclanthology.coli.uni-saarland.de/papers/Q17-1026/q17-1026 https://aclanthology.coli.uni-saarland.de/papers/Q17-1026/q17-1026 https://aclanthology.coli.uni-saarland.de/papers/D15-1106/d15-1106 https://aclanthology.coli.uni-saarland.de/papers/D15-1106/d15-1106 https://aclanthology.info/papers/D15-1166/d15-1166 https://aclanthology.info/papers/D15-1166/d15-1166 Proceedings of the 2015 Conference on Empir- ical Methods in Natural Language Processing, pages 1412–1421, Lisbon, Portugal. Thomas Mensink, Jakob Verbeek, Florent Per- ronnin, and Gabriela Csurka. 2012. Metric learning for large scale image classification: Generalizing to new classes at near-zero cost. In Computer Vision – ECCV 2012, pages 488– 501, Berlin, Heidelberg. Springer Berlin Hei- delberg. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Wein- berger, editors, Advances in Neural Information Processing Systems 26, pages 3111–3119. Cur- ran Associates, Inc. Frederic Morin and Yoshua Bengio. 2005. Hier- archical probabilistic neural network language model. In Proceedings of the Tenth Interna- tional Workshop on Artificial Intelligence and Statistics, pages 246–252. Khalil Mrini, Nikolaos Pappas, and Andrei Popescu-Belis. 2017. Cross-lingual transfer for news article labeling: Benchmarking statistical and neural models. In Idiap Research Report, Idiap-RR-26-2017. Jinseok Nam, Eneldo Loza Mencía, and Johannes Fürnkranz. 2016. All-in text: Learning docu- ment, label, and word representations jointly. In Proceedings of the 13th AAAI Conference on Artificial Intelligence, AAAI’16, pages 1948– 1954, Phoenix, Arizona. Mohammad Norouzi, Tomas Mikolov, Samy Ben- gio, Yoram Singer, Jonathon Shlens, Andrea Frome, Greg Corrado, and Jeffrey Dean. 2014. Zero-shot learning by convex combination of semantic embeddings. In International Con- ference on Learning Representations, Banff, Canada. Bo Pang and Lillian Lee. 2005. Seeing stars: Ex- ploiting class relationships for sentiment cate- gorization with respect to rating scales. In Pro- ceedings of the 43rd Annual Meeting on As- sociation for Computational Linguistics, pages 115–124, Ann Arbor, Michigan. Nikolaos Pappas, Lesly Miculicich, and James Henderson. 2018. Beyond weight tying: Learn- ing joint input-output embeddings for neural machine translation. In Proceedings of the Third Conference on Machine Translation: Re- search Papers, pages 73–83, Belgium, Brussels. Association for Computational Linguistics. Nikolaos Pappas and Andrei Popescu-Belis. 2017. Multilingual hierarchical attention networks for document classification. In Proceedings of the Eighth International Joint Conference on Natu- ral Language Processing (Volume 1: Long Pa- pers), pages 1015–1025. Alexander M. Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. In Pro- ceedings of the 2015 Conference on Empiri- cal Methods in Natural Language Processing, pages 379–389, Lisbon, Portugal. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers), pages 1715–1725, Berlin, Germany. Lei Shu, Hu Xu, and Bing Liu. 2017. DOC: Deep open classification of text documents. In Pro- ceedings of the 2017 Conference on Empiri- cal Methods in Natural Language Processing, pages 2911–2916, Copenhagen, Denmark. As- sociation for Computational Linguistics. Richard Socher, Milind Ganjoo, Christopher D. Manning, and Andrew Y. Ng. 2013. Zero- shot learning through cross-modal transfer. In Proceedings of the 26th International Confer- ence on Neural Information Processing Sys- tems, NIPS’13, pages 935–943, Lake Tahoe, Nevada. Vivek Srikumar and Christopher D. Manning. 2014. Learning distributed representations for structured output prediction. In Proceedings of the 27th International Conference on Neu- ral Information Processing Systems - Volume 2, NIPS’14, pages 3266–3274, Cambridge, MA, USA. MIT Press. Duyu Tang, Bing Qin, and Ting Liu. 2015. Doc- ument modeling with gated recurrent neural https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality https://www.iro.umontreal.ca/~lisa/pointeurs/hierarchical-nnlm-aistats05.pdf https://www.iro.umontreal.ca/~lisa/pointeurs/hierarchical-nnlm-aistats05.pdf https://www.iro.umontreal.ca/~lisa/pointeurs/hierarchical-nnlm-aistats05.pdf https://publidiap.idiap.ch/index.php/publications/show/3642 https://publidiap.idiap.ch/index.php/publications/show/3642 https://publidiap.idiap.ch/index.php/publications/show/3642 https://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12058 https://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12058 https://arxiv.org/abs/1312.5650 https://arxiv.org/abs/1312.5650 https://aclanthology.coli.uni-saarland.de/papers/P05-1015/p05-1015 https://aclanthology.coli.uni-saarland.de/papers/P05-1015/p05-1015 https://aclanthology.coli.uni-saarland.de/papers/P05-1015/p05-1015 http://www.aclweb.org/anthology/W18-6308 http://www.aclweb.org/anthology/W18-6308 http://www.aclweb.org/anthology/W18-6308 http://aclweb.org/anthology/I17-1102 http://aclweb.org/anthology/I17-1102 https://aclanthology.coli.uni-saarland.de/papers/D15-1044/d15-1044 https://aclanthology.coli.uni-saarland.de/papers/D15-1044/d15-1044 https://aclanthology.coli.uni-saarland.de/papers/P16-1162/p16-1162 https://aclanthology.coli.uni-saarland.de/papers/P16-1162/p16-1162 https://www.aclweb.org/anthology/D17-1314 https://www.aclweb.org/anthology/D17-1314 https://papers.nips.cc/paper/5027-zero-shot-learning-through-cross-modal-transfer https://papers.nips.cc/paper/5027-zero-shot-learning-through-cross-modal-transfer http://dl.acm.org/citation.cfm?id=2969033.2969191 http://dl.acm.org/citation.cfm?id=2969033.2969191 https://doi.org/10.18653/v1/D15-1167 https://doi.org/10.18653/v1/D15-1167 network for sentiment classification. In Pro- ceedings of the 2015 Conference on Empiri- cal Methods in Natural Language Processing, pages 1422–1432, Lisbon, Portugal. Associa- tion for Computational Linguistics. Guoyin Wang, Chunyuan Li, Wenlin Wang, Yizhe Zhang, Dinghan Shen, Xinyuan Zhang, Ricardo Henao, and Lawrence Carin. 2018. Joint em- bedding of words and labels for text classifica- tion. In Proceedings of the 56th Annual Meet- ing of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 2321– 2331. Association for Computational Linguis- tics. Jason Weston, Samy Bengio, and Nicolas Usunier. 2010. Large scale image annotation: Learn- ing to rank with joint word-image embeddings. Mach. Learn., 81(1):21–35. Jason Weston, Samy Bengio, and Nicolas Usunier. 2011. WSABIE: Scaling up to large vocab- ulary image annotation. In Proceedings of the Twenty-Second International Joint Confer- ence on Artificial Intelligence (Volume 3), pages 2764–2770, Barcelona, Spain. Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hier- archical attention networks for document clas- sification. In Proceedings of the 2016 Confer- ence of the North American Chapter of the As- sociation for Computational Linguistics: Hu- man Language Technologies, pages 1480–1489, San Diego, California. Majid Yazdani and James Henderson. 2015. A model of zero-shot learning of spoken language understanding. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 244–249, Lisbon, Portugal. Chih-Kuan Yeh, Wei-Chieh Wu, Wei-Jen Ko, and Yu-Chiang Frank Wang. 2018. Learning deep latent spaces for multi-label classification. In In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, New Orleans, USA. Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems 28, pages 649– 657, Montreal, Canada. Yang Zhang, Boqing Gong, and Mubarak Shah. 2016. Fast zero-shot image tagging. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA. https://doi.org/10.18653/v1/D15-1167 http://aclweb.org/anthology/P18-1216 http://aclweb.org/anthology/P18-1216 http://aclweb.org/anthology/P18-1216 https://doi.org/10.1007/s10994-010-5198-3 https://doi.org/10.1007/s10994-010-5198-3 https://ai.google/research/pubs/pub37180 https://ai.google/research/pubs/pub37180 https://aclanthology.coli.uni-saarland.de/papers/N16-1174/n16-1174 https://aclanthology.coli.uni-saarland.de/papers/N16-1174/n16-1174 https://aclanthology.coli.uni-saarland.de/papers/N16-1174/n16-1174 https://aclanthology.coli.uni-saarland.de/papers/D15-1027/d15-1027 https://aclanthology.coli.uni-saarland.de/papers/D15-1027/d15-1027 https://aclanthology.coli.uni-saarland.de/papers/D15-1027/d15-1027 https://arxiv.org/abs/1707.00418 https://arxiv.org/abs/1707.00418 http://papers.nips.cc/paper/5782-character-level-convolutional-networks-for-text-classification.html http://papers.nips.cc/paper/5782-character-level-convolutional-networks-for-text-classification.html https://arxiv.org/abs/1605.09759