GILE: A Generalized Input-Label Embedding for Text Classification

Nikolaos Pappas James Henderson
Idiap Research Institute, Martigny 1920, Switzerland

{nikolaos.pappas,james.henderson@idiap.ch}

Abstract

Neural text classification models typically
treat output labels as categorical variables
which lack description and semantics. This
forces their parametrization to be depen-
dent on the label set size, and, hence, they
are unable to scale to large label sets and
generalize to unseen ones. Existing joint
input-label text models overcome these is-
sues by exploiting label descriptions, but
they are unable to capture complex label re-
lationships, have rigid parametrization, and
their gains on unseen labels happen often
at the expense of weak performance on the
labels seen during training. In this paper,
we propose a new input-label model which
generalizes over previous such models, ad-
dresses their limitations, and does not com-
promise performance on seen labels. The
model consists of a joint non-linear input-
label embedding with controllable capacity
and a joint-space-dependent classification
unit which is trained with cross-entropy loss
to optimize classification performance. We
evaluate models on full-resource and low- or
zero-resource text classification of multilin-
gual news and biomedical text with a large
label set. Our model outperforms monolin-
gual and multilingual models which do not
leverage label semantics and previous joint
input-label space models in both scenarios.

1 Introduction

Text classification is a fundamental NLP task with
numerous real-world applications such as topic
recognition (Tang et al., 2015; Yang et al., 2016),
sentiment analysis (Pang and Lee, 2005; Yang
et al., 2016), and question answering (Chen et al.,
2015; Kumar et al., 2015). Classification also ap-
pears as a sub task for sequence prediction tasks
such as neural machine translation (Cho et al.,
2014; Luong et al., 2015), and summarization

(Rush et al., 2015). Despite the numerous stud-
ies, existing models are trained on a fixed label
set using k-hot vectors and, therefore, treat target
labels as mere atomic symbols without any partic-
ular structure to the space of labels, ignoring po-
tential linguistic knowledge about the words used
to describe the output labels. Given that seman-
tic representations of words have been shown to
be useful for representing the input, it is reason-
able to expect that they are going to be useful for
representing the labels as well.

Previous work has leveraged knowledge from
the label texts through a joint input-label space,
initially for image classification (Weston et al.,
2011; Mensink et al., 2012; Frome et al., 2013;
Socher et al., 2013). Such models generalize to
labels both seen and unseen during training, and
scale well on very large label sets. However, as we
explain in Section 2, existing input-label models
for text (Yazdani and Henderson, 2015; Nam et al.,
2016) have the following limitations: (i) their em-
bedding does not capture complex label relation-
ships due to its bilinear form, (ii) their output layer
parametrization is rigid because it depends on the
dimensionality of the encoded text and labels, and,
(iii) they are outperformed on seen labels by clas-
sification baselines trained with cross-entropy loss
(Frome et al., 2013; Socher et al., 2013).

In this paper, we propose a new joint input-label
model which generalizes over previous such mod-
els, addresses their limitations, and does not com-
promise performance on seen labels (see Figure
1). The proposed model is comprised of a joint
non-linear input-label embedding with control-
lable capacity and a joint-space-dependent classi-
fication unit which is trained with cross-entropy
loss to optimize classification performance.1 The
need for capturing complex label relationships
is addressed by two non-linear transformations
which have the same target joint space dimension-

1Our code is available at: github.com/idiap/gile

github.com/idiap/gile


ality. The parametrization of the output layer is
not constrained by the dimensionality of the in-
put or label encoding, but is instead flexible with a
capacity which can be easily controlled by choos-
ing the dimensionality of the joint space. Training
is performed with cross-entropy loss, which is a
suitable surrogate loss for classification problems,
as opposed to a ranking loss such as WARP loss
(Weston et al., 2010) which is more suitable for
ranking problems.

Evaluation is performed on full-resource and
low- or zero-resource scenarios of two text clas-
sification tasks, namely on biomedical semantic
indexing (Nam et al., 2016) and on multilingual
news classification (Pappas and Popescu-Belis,
2017) against several competitive baselines. In
both scenarios, we provide a comprehensive abla-
tion analysis which highlights the importance of
each model component and the difference with
previous embedding formulations when using the
same type of architecture and loss function.

Our main contributions are the following:

(i) We identify key theoretical and practical lim-
itations of existing joint input-label models.

(ii) We propose a novel joint input-label embed-
ding with flexible parametrization which gen-
eralizes over the previous such models and
addresses their limitations.

(iii) We provide empirical evidence of the supe-
riority of our model over monolingual and
multilingual models which ignore label se-
mantics, and over previous joint input-label
models on both seen and unseen labels.

The remainder of this paper is organized as fol-
lows. Section 2 provides background knowledge
and explains limitations of existing models. Sec-
tion 3 describes the model components, training
and relation to previous formulations. Section
4 describes our evaluation results and analysis,
while Section 5 provides an overview of previous
work and Section 6 concludes the paper and pro-
vides future research directions.

2 Background: Neural Text
Classification

We are given a collection D = {(xi,yi), i =
1, . . . ,N} made of N documents, where each
document xi is associated with labels yi =
{yij ∈ {0, 1} | j = 1, . . . ,k}, and k is the

total number of labels. Each document xi =
{w11,w12, . . . ,wKiTKi} is a sequence of words
grouped into sentences, with Ki being the num-
ber of sentences in document i and Tj being the
number of words in sentence j. Each label j has
a textual description comprised of multiple words,
cj = {cj1,cj2, . . . ,cjLj | j = 1, . . . ,k} with Lj
being the number of words in each description.
Given the input texts and their associated labels
seen during the training portion of D, our goal is
to learn a text classifier which is able to predict
labels both in the seen, Ys, or unseen, Yu, label
sets, defined as the sets of unique labels which
have been seen or not during training respectively
and, hence, Y ∩Yu = ∅ and Y = Ys ∪Yu.2

2.1 Input Text Representation

To encode the input text, we focus on hierarchical
attention networks (HANs), which are competitive
for monolingual (Yang et al., 2016) and multilin-
gual text classification (Pappas and Popescu-Belis,
2017). The model takes as input a document x and
outputs a document vector h. The input words and
label words are represented by vectors in IRd from
the same3 embeddings E ∈ IR|V|×d, where V is
the vocabulary and d is the embedding dimension;
E can be pre-trained or learned jointly with the
rest of the model. The model has two levels of
abstraction, word and sentence. The word level is
made of an encoder network gw and an attention
network aw, while the sentence level similarly in-
cludes an encoder and an attention network.
Encoders. The function gw encodes the sequence
of input words {wit | t = 1, . . . ,Ti} for each sen-
tence i of the document, noted as:

h(it)w = gw(wit), t ∈ [1,Ti], (1)

and at the sentence level, after combining the in-
termediate word vectors {h(it)w | t = 1, . . . ,Ti} to
a sentence vector si ∈ IRdw (see below), where
dw is the dimension of the word encoder, the func-
tion gs encodes the sequence of sentence vectors
{si | i = 1, . . . ,K}, noted as h

(i)
s . The gw and

gs functions can be any feed-forward (DENSE) or
recurrent networks, e.g. GRU (Cho et al., 2014).
Attention. The αw and αs attention mechanisms,
which estimate the importance of each hidden

2Note that depending on the number of labels per docu-
ment the problem can be a multi-label or multi-class problem.

3This statement holds true for multilingual classification
problems too if the embeddings are aligned across languages.


state vector, are used to obtain the sentence si and
document representation h respectively. The sen-
tence vector is thus calculated as follows:

si =

Ti∑
t=1

α(it)w h
(it)
w =

Ti∑
t=1

exp(v>ituw)∑
j exp(v

>
ijuw)

h(it)w ,

(2)
where vit = fw(h

(it)
w ) is a fully-connected net-

work with Ww parameters. The document vector
h ∈ IRdh , where dh is the dimension of the sen-
tence encoder, is calculated similarly, by replacing
uit with vi = fs(h

(i)
s ) which is a fully-connected

network with Ws parameters, and uw with us,
which are parameters of the attention functions.

2.2 Label Text Representation

To encode the label text we use an encoder func-
tion which takes as input a label description cj and
outputs a label vector ej ∈ IRdc ∀j = 1, . . . , k. For
efficiency reasons, we use a simple, parameter-
free function to compute ej , namely the average
of word vectors which describe label j, namely
ej =

1
Lj

∑Lj
t=1 cjt, and hence dc = d in this

case. By stacking all these label vectors into a ma-
trix, we obtain the label embedding E ∈ IR|Y|×d.
In principle, we could also use the same encoder
functions as the ones for input text, but this would
increase the computation significantly; hence, we
keep this direction as future work.

2.3 Output Layer parametrizations

2.3.1 Typical Linear Unit
The most typical output layer, consists of a linear
unit with a weight matrix W ∈ IRdh×|Y| and a
bias vector b ∈ IR|Y| followed by a softmax or
sigmoid activation function. Given the encoder’s
hidden representation h with dimension size dh,
the probability distribution of output y given input
x is proportional to the following quantity:

p(y|x) ∝ exp(W>h + b). (3)

The parameters in W can be learned separately or
be tied with the parameters of the embedding E by
setting W = ET if the input dimension of W is
restricted to be the same as that of the embedding
E (d = dh) and each label is represented by a
single word description i.e. when Y corresponds
to V and E = E. In the latter case, Eq. 3 becomes:

p(y|x) ∝ exp(Eh + b). (4)

Either way, the parameters of such models are typ-
ically learned with cross-entropy loss, which is
suitable for classification problems. However, in
both cases they cannot be applied to labels which
are not seen during training, because each label
has learned parameters which are specific to that
label, so the parameters for unseen labels cannot
be learned. We now turn our focus to a class of
models which can handle unseen labels.

2.3.2 Bilinear Input-Label Unit
Joint input-output embedding models can gener-
alize from seen to unseen labels because the pa-
rameters of the label encoder are shared. The
previously proposed joint input-output embedding
models by Yazdani and Henderson (2015) and
Nam et al. (2016) are based on the following bi-
linear ranking function f(·):

f(x,y) = EWh, (5)

where E ∈ IR|Y|×d is the label embedding and
W ∈ IRd×dh is the bilinear embedding. This func-
tion allows one to define the rank of a given label
y with respect to x and is trained using hinge loss
to rank positive labels higher than negative ones.
But note that the use of this ranking loss means
that they do not model the conditional probability,
as do the traditional models above.
Limitations. Firstly, the above formula can only
capture linear relationships between encoded text
(h) and label embedding (E) through W. We argue
that the relationships between different labels are
non-linear due to the complex interactions of the
semantic relations across labels but also between
labels and different encoded inputs. A more ap-
propriate form for this purpose would include a
non-linear transformation σ(·), e.g. with either:

(a) σ(EW)︸ ︷︷ ︸
Label structure

h or (b) E σ(Wh)︸ ︷︷ ︸
Input structure

. (6)

Secondly, it is hard to control their output layer
capacity due to their bilinear form, which uses a
matrix of parameters (W) whose size is bounded
by the dimensionalities of the label embedding
and the text encoding. Thirdly, their loss function
optimizes ranking instead of classification perfor-
mance and thus treats the ground-truth as a ranked
list when in reality it consists of one or more inde-
pendent labels.
Summary. We hypothesize that these are the rea-
sons why these models do not yet perform well on


seen labels compared to models which make use
of the typical linear unit, and they do not take full
advantage of the structure of the problem when
tested on unseen labels. Ideally, we would like to
have a model which will address these issues and
will combine the benefits from both the typical lin-
ear unit and the joint input-label models.

3 The Proposed Output Layer Paramet-
rization for Text Classification

We propose a new output layer parametrization for
neural text classification which is comprised of a
generalized input-label embedding which captures
the structure of the labels, the structure of the en-
coded texts and the interactions between the two,
followed by a classification unit which is indepen-
dent of the label set size. The resulting model
has the following properties: (i) it is able to cap-
ture complex output structure, (ii) it has a flexi-
ble parametrization which allows its capacity to
be controlled, and (iii) it is trained with a classi-
fication surrogate loss such as cross-entropy. The
model is depicted in Figure 1. In this section, we
describe the model in detail, showing how it can
be trained efficiently for arbitrarily large label sets
and how it is related to previous models.

3.1 A Generalized Input-Label Embedding

Let gin(h) and gout(ej) be two non-linear projec-
tions of the encoded input, i.e. the document h,
and any encoded label ej, where ej is the jth row
vector from the label embedding matrix E, which
have the following form:

e′j = gout(ej) = σ(ejU + bu) (7)

h′ = gin(h) = σ(V h + bv), (8)

where σ(·) is a nonlinear activation function such
as ReLU or Tanh, the matrix U ∈ IRd×dj and bias
bu ∈ IRdj are the linear projection of the labels,
and the matrix V ∈ IRdj×dh and bias bv ∈ IRdj are
the linear projection of the encoded input. Note
that the projections for h′ and e′j could be high-
rank or low-rank depending on their initial dimen-
sions and the target joint space dimension. Also
let E′ ∈ IR|Y|×dj be the matrix resulting from pro-
jecting all the outputs ej to the joint space, i.e.
gout(E).

The conditional output probability distribution

h'    e2

U

V
∧h'    e1

Classification unit

w

y1
∧y2

…

Joint space

h'    ek yk
∧
…

Label
Encoder

cj1
cj2
cjLj

...
ei

Input
Encoder

w11
w12
wKiTKi

...
h

 Encoders
dh x dj

d x dj

 W
or

d 
em

be
dd

ing
s

'
'

'

T

Figure 1: Each encoded text and label are projected to
a joint input-label multiplicative space, the output of
which is processed by a classification unit with label-
set-size independent parametrization.

can now be re-written as:

p(y|x) ∝ exp
(
E′h′

)
∝ exp

(
gout(E)gin(h)

)
∝ exp

(
σ(EU + bu)︸ ︷︷ ︸
Label Structure

σ(V h + bv)︸ ︷︷ ︸
Input Structure

)
. (9)

Crucially, this function has no label-set-size de-
pendent parameters, unlike W and b in Eq. 3. In
principle, this parametrization can be used for both
multi-class and multi-label problems by defining
the exponential in terms of a softmax and sigmoid
functions respectively. However, in this paper we
will focus on the latter.

3.2 Classification Unit
We require that our classification unit parameters
depend only on the joint input-label space above.
To represent the compatibility between any en-
coded input text hi and any encoded label ej for
this task, we define their joint representation based
on multiplicative interactions in the joint space:

g
(ij)
joint = gin(hi) �gout(ej), (10)

where � is component-wise multiplication.
The probability for hi to belong to one of the k

known labels is modeled by a linear unit which
maps any point in the joint space into a score
which indicates the validity of the combination:

p
(ij)
val = g

(ij)
jointw + b, (11)

where w ∈ IRdj and b are a scalar variables. We
compute the output of this linear unit for each
known label which we would like to predict for
a given document i, namely:

P
(i)
val =



p
(i1)
val

p
(i2)
val

. . .

p
(ik)
val


 =



g
(i1)
jointw + b

g
(i2)
jointw + b

. . .

g
(ik)
jointw + b


 . (12)


For each row, the higher the value the more likely
the label is to be assigned to the document. To ob-
tain valid probability estimates and be able to train
with binary cross-entropy loss for multi-label clas-
sification, we apply a sigmoid function as follows:

ŷi = p̂(yi|xi) =
1

1 + e−P
(i)
val

. (13)

Summary. By adding the above changes to the
general form of Eq. 9 the conditional probabil-
ity p(yi|xi) is now proportional to the following
quantity:

exp
(
σ(EU + bu)(σ(V h + bv) �w) + b

)
. (14)

Note that the number of parameters in this equa-
tion is independent of the size of the label set,
given that U, V , w and b depend only on dj, and
k can vary arbitrarily. This allows the model to
scale up to large label sets and generalize to un-
seen labels. Lastly, the proposed output layer ad-
dresses all the limitations of the previous models,
as follows: (i) it is able to capture complex struc-
ture in the joint input-output space, (ii) it provides
a means to easily control its capacity dj, and (iii)
it is trainable with cross-entropy loss.

3.3 Training Objectives
The training objective for the multi-label classifi-
cation task is based on binary cross-entropy loss.
Assuming θ contains all the parameters of the
model, the training loss is computed as follows:

L(θ) = −
1

Nk

N∑
i=1

k∑
j=1

H(yij, ŷij), (15)

where H is the binary cross-entropy between the
gold label yij and predicted label ŷij for a docu-
ment i and a candidate label j.

We handle multiple languages according to Fi-
rat et al. (2016) and Pappas and Popescu-Belis
(2017). Assuming that Θ = {θ1,θ2, ...,θM} are
all the parameters required for each of the M lan-
guages, we use a joint multilingual objective based
on the sum of cross-entropy losses:

L(Θ) = −
1

Z

Ne∑
i

M∑
l

k∑
j=1

H(y(l)ij , ŷ
(l)
ij ), (16)

where Z = NeMk with Ne being the num-
ber of examples per epoch. At each iteration,

a document-label pair for each language is sam-
pled. In addition, multilingual models share a
certain subset of the encoder parameters during
training while the output layer parameters are kept
language-specific, as described by Pappas and
Popescu-Belis (2017). In this paper, we share
most of the output layer parameters, namely the
ones from the input-label space (U, V, bv, bu), and
we keep only the classification unit parameters (w,
b) language-specific.

3.4 Scaling Up to Large Label Sets

For a very large number dj of joint-space di-
mensions in our parametrization, the computa-
tional complexity increases prohibitively because
our projection requires a large matrix multiplica-
tion between U and E, which depends on |Y|. In
such cases, we resort to sampling-based training,
by adopting the commonly used negative sampling
method proposed by Mikolov et al. (2013). Let
xi ∈ IRd and yik ∈ {0, 1} be an input-label pair
and ŷik the output probabilities from our model
(Eq. 14). By introducing the sets kpi and k

n
i , which

contain the indices of the positive and negative la-
bels respectively for the i-th input, the loss L(θ) in
Eq. 15 can be re-written as follows:

= −
1

Z

N∑
i=1

k∑
j=1

[
yij log ŷij + ȳij log (1 − ŷij)

]

= −
1

Z

N∑
i=1

[ kpi∑
j=1

log ŷij +

kni∑
j=1

log (1 − ŷij)
]
,

(17)

where Z = Nk and ȳij is (1 − yij). To reduce
the computational cost needed to evaluate ŷij for
all the negative label set kni , we sample k

∗ la-
bels from the negative label set with probability
p = 1|kni |

to create the set kni . This enables training
on arbitrarily big label sets without increasing the
computation required. By controlling the number
of samples we can drastically speed up the train-
ing time, as we demonstrate empirically in Sec-
tion 4.2.2. Exploring more informative sampling
methods, e.g. importance sampling, would be an
interesting direction of future work.

3.5 Relation to Previous Parametrizations

The proposed embedding form can be seen as
a generalization over the input-label embeddings
with a bilinear form, because its degenerate form


is equivalent to the bilinear form of Eq. 5. In par-
ticular, this can be simply derived if we set one
of the two non-linear projection functions in the
second line of Eq. 9 to be the identity function,
e.g. gout(·) = I, set all biases to zero, and make
the σ(.) activation function linear, as follows:

σ(EU + bu)σ(V h + bv) = (EI) (V h)
= EV h �, (18)

where V by consequence has the same number
of dimensions as W ∈ IRd×dh from the bilinear
input-label embedding model of Eq. 5.

4 Experiments

The evaluation is performed on large-scale
biomedical semantic indexing using the BioASQ
dataset, obtained by Nam et al. (2016), and on
multilingual news classification using the DW cor-
pus, which consists of eight language datasets ob-
tained by Pappas and Popescu-Belis (2017). The
statistics of these datasets are listed in Table 1.

4.1 Biomedical Text Classification

We evaluate on biomedical text classification
to demonstrate that our generalized input-label
model scales to very large label sets and performs
better than previous joint input-label models on
both seen and unseen label prediction scenarios.

4.1.1 Settings
We follow the exact evaluation protocol, data and
settings of Nam et al. (2016), as described below.
We use the BioASQ Task 3a dataset, which is a
collection of scientific publications in biomedical
research. The dataset contains about 12M docu-
ments labeled with around 11 labels out of 27,455,
which are defined according to the Medical Sub-
ject Headings (MESH) hierarchy. The data was
minimally pre-processed with tokenization, num-
ber replacements (NUM), rare word replacements
(UNK), and split with the provided script by year
so that the training set includes all documents until
2004 and the ones from 2005 to 2015 were kept for
the test set, this corresponded to 6,692,815 docu-
ments for training and 4,912,719 for testing. For
validation, a set of 100,000 documents were ran-
domly sampled from the training set. We report
the same ranking-based evaluation metrics as Nam
et al. (2016), namely rank loss (RL), average pre-
cision (AvgPr) and one-error loss (OneErr).

Dataset Documents Labels
abbrev. # count # words w̄d # count w̄l

BioASQ 11,705,534 528,156 214 26,104 35.0
DW 598,304 884,272 436 5,637 2.3
– en 112,816 110,971 516 1,385 2.1
– de 132,709 261,280 424 1,176 1.8
– es 75,827 130,661 412 843 4.7
– pt 39,474 58,849 571 396 1.8
– uk 35,423 105,240 342 288 1.7
– ru 108,076 123,493 330 916 1.8
– ar 57,697 58,922 357 435 2.4
– fa 36,282 34,856 538 198 2.5

Table 1: Dataset statistics: #count is the number of
documents, #words are the number of unique words in
the vocabulary V, w̄d and w̄l are the average number of
words per document and label respectively.

Our hyper-parameters were selected on valida-
tion data based on average precision as follows:
100-dimensional word embeddings, encoder, at-
tention (same dimensions as the baselines), joint
input-label embedding of 500, batch size of 64,
maximum number of 300 words per document and
50 words per label, ReLU activation, 0.3% nega-
tive label sampling, and optimization with ADAM
until convergence. The word embeddings were
learned end-to-end on the task.4

The baselines are the joint input-label models
from Nam et al. (2016), noted as [N16], namely:

• WSABIE+: This model is an extension of
the original WSABIE model by Weston et al.
(2011), which instead of learning a ranking
model with fixed document features, it jointly
learns features for documents and words, and
is trained with the WARP ranking loss.

• AiTextML: This model is the one proposed
by Nam et al. (2016) with the purpose
of learning jointly representations of docu-
ments, labels and words, along with a joint
input-label space which is trained with the
WARP ranking loss.

The scores of the WSABIE+ and AiTextML base-
lines in Table 2 are the ones reported by Nam et al.
(2016). In addition, we report scores of a word-
level attention neural network (WAN) with DENSE
encoder and attention followed by a sigmoid out-

4Here, the word embeddings are included in the parameter
statistics because they are variables of the network.


Model Layer form Dim Seen labels Unseen labels Params
abbrev. output #count RL AvgPr OneErr RL AvgPr OneErr #count

[N
16

] WSABIE+ EWht 100 5.21 36.64 41.72 48.81 0.37 99.94 722.10M
AiTextML avg EWht 100 3.54 32.78 25.99 52.89 0.39 99.94 724.47M
AiTextML inf EWht 100 3.54 32.78 25.99 21.62 2.66 98.61 724.47M

B
as

el
in

es WAN W>ht – 1.53 42.37 11.23 – – – 55.60M
BIL-WAN [YH15] σ(EW)Wht 100 1.21 40.68 17.52 18.72 9.50 93.89 52.85M
BIL-WAN [N16] EWht 100 1.12 41.91 16.94 16.26 10.55 93.23 52.84M

O
ur

s

GILE-WAN σ(EU)σ(V ht) 500 0.78 44.39 11.60 9.06 12.95 91.90 52.93M
− constrained dj σ(EW)σ(Wht) 100 1.01 37.71 16.16 10.34 11.21 93.38 52.85M
− only label (Eq. 6a) σ(EW)ht 100 1.06 40.81 13.77 9.77 14.71 90.56 52.84M
− only input (Eq. 6b) Eσ(Wht) 100 1.07 39.78 15.67 19.28 7.18 95.91 52.84M

Table 2: Biomedical semantic indexing results computed over labels seen and unseen during training, i.e. the
full-resource versus zero-resource settings. Best scores among the competing models are marked in bold.

put layer, trained with binary cross-entropy loss.5

Our model replaces WAN’s output layer with a
generalized input-label embedding layer and its
variations, noted GILE-WAN. For comparison, we
also compare to bilinear input-label embedding
versions of WAN for the model by Yazdani and
Henderson (2015), noted as BIL-WAN [YH16],
and the one by Nam et al. (2016), noted as BIL-
WAN [N16]. Note that the AiTextML parameter
space is huge and makes learning difficult for our
models (linear wrt. labels and documents). In-
stead, we make sure that our models have far fewer
parameters than the baselines (Table 2).

4.1.2 Results
The results on biomedical semantic indexing on
seen and unseen labels are shown in Table 2. We
observe that the neural baseline, WAN, outper-
forms WSABIE+ and AiTextML on the seen la-
bels, namely by +5.73 and +9.59 points in terms
of AvgPr respectively. The differences are even
more pronounced when considering the ranking
loss and one error metrics. This result is compati-
ble with previous findings that existing joint input-
label models are not able to outperform strong su-
pervised baselines on seen labels. However, WAN
is not able to generalize at all to unseen labels,
hence the WSABIE+ and AiTextML have a clear
advantage in the zero-resource setting.

In contrast, our generalized input-label model,

5 In our preliminary experiments, we also trained the neu-
ral model with a hinge loss as WSABIE+ and AiTextML, but
it performed similarly to them and much worse than WAN,
so we did not further experiment with it.

GILE-WAN, outperforms WAN even on seen la-
bels, where our model has higher average preci-
sion by +2.02 points, better ranking loss by +43%
and comparable OneErr (−3%). And this gain is
not at the expense of performance on unseen la-
bels. GILE-WAN, outperforms WSABIE+, Ai-
TextML variants6 by a large margin in both cases,
e.g. by +7.75, +11.61 points on seen labels and by
+12.58, +10.29 points in terms of average preci-
sion on unseen labels, respectively. Interestingly,
our GILE-WAN model also outperforms the two
previous bilinear input-label embedding formula-
tions of Yazdani and Henderson (2015) and Nam
et al. (2016), namely BIL-WAN [YH15] and BIL-
WAN [N16] by +3.71, +2.48 points on seen la-
bels and +3.45 and +2.39 points on unseen la-
bels, respectively, even when they are trained with
the same encoders and loss as ours. These mod-
els are not able to outperform the WAN baseline
when evaluated on the seen labels, namely they
have −1.68 and −0.46 points lower average preci-
sion than WAN, but they outperform WSABIE+
and AiTextML on both seen and unseen labels.
Overall, the results show a clear advantage of our
generalized input-label embedding model against
previous models on both seen and unseen labels.

4.1.3 Ablation Analysis
To evaluate the effectiveness of individual com-
ponents of our model, we performed an ablation
study (last three rows in Table 2). Note that when
we use only the label or only the input embedding

6Namely, avg when using the average of word vectors and
inf when using inferred label vectors to make predictions.


in our generalized input-label formulation, the di-
mensionality of the joint space is constrained to
be the dimensionality of the encoded labels and
inputs respectively, that is dj=100 in our experi-
ments.

All three variants of our model outperform
previous embedding formulations of Nam et al.
(2016) and Yazdani and Henderson (2015) in all
metrics except from AvgPr on seen labels where
they score slightly lower. The decrease in AvgPrec
for our model variants with dj=100 compared to
the neural baselines could be attributed to the dif-
ficulty in learning the parameters of a highly non-
linear space with only a few hidden dimensions.
Indeed, when we increase the number of dimen-
sions (dj=500), our full model outperforms them
by a large margin. Recall that this increase in ca-
pacity is only possible with our full model defini-
tion in Eq. 9 and none of the other variants allow
us to do this without interfering with the original
dimensionality of the encoded labels (E) and input
(ht). In addition, our model variants with dj=100
exhibit consistently higher scores than baselines in
terms of most metrics on both seen and unseen la-
bels, which suggests that they are able to capture
more complex relationships across labels and be-
tween encoded inputs and labels.

Overall, the best performance among our model
variants is achieved when using only the label em-
bedding and, hence, it is the most significant com-
ponent of our model. Surprisingly, our model with
only the label embedding achieves higher perfor-
mance than our full model on unseen labels but it
is far behind our full model when we consider per-
formance on both seen and unseen labels. When
we constrain our full model to have the same di-
mensionality with the other variants, i.e. dj=100,
it outperforms the one that uses only the input em-
bedding in most metrics and it is outperformed by
the one that uses only the label embedding.

4.2 Multilingual News Text Classification

We evaluate on multilingual news text classifica-
tion to demonstrate that our output layer based
on the generalized input-label embedding outper-
forms previous models with a typical output layer
in a wide variety of settings, even for labels which
have been seen during training.

4.2.1 Settings
We follow the exact evaluation protocol, data and
settings of Pappas and Popescu-Belis (2017), as

described below. The dataset is split per language
into 80% for training, 10% for validation and 10%
for testing. We evaluate on both types of labels
(general Yg, and specific Ys) in a full-resource
scenario, and we evaluate only on the general la-
bels (Yg) in a low-resource scenario. Accuracy is
measured with the micro-averaged F1 percentage
scores.

The word embeddings for this task are the
aligned pre-trained 40-dimensional multi-CCA
multilingual word embeddings by Ammar et al.
(2016) and are kept fixed during training.7 The
sentences are already truncated at a length of 30
words and the documents at a length of 30 sen-
tences. The hyper-parameters were selected on
validation data as follows: 100-dimensional en-
coder and attention, ReLU activation, batch size
of 16, epoch size of 25k, no negative sampling
(all labels are used) and optimization with ADAM
until convergence. To ensure equal capacity to
baselines, we use approximately the same number
of parameters ntot with the baseline classification
layers, by setting:

dj '
dh ∗ |k(i)|
dh + d

, i = 1, . . . ,M, (19)

in the monolingual case, and similarly, dj '
(dh ∗

∑M
i=1 |k

(i)|)/(dh + d) in the multilingual
case, where k(i) is the number of labels in lan-
guage i.

The hierarchical models have Dense encoders
in all scenarios (Tables 3, 6, and 7), except from
the varying encoder experiment (Table 4). For
the low-resource scenario, the levels of data avail-
ability are: tiny from 0.1% to 0.5%, small from
1% to 5% and medium from 10% to 50% of the
original training set. For each level, the aver-
age F1 across discrete increments of 0.1, 1 and
10 are reported respectively. The decision thresh-
olds, which were tuned on validation data by Pap-
pas and Popescu-Belis (2017), are set as follows:
for the full-resource scenario it is set to 0.4 for
|Ys| < 400 and 0.2 for |Ys| ≥ 400, and for the
low-resource scenario it is set to 0.3 for all sets.

The baselines are all the monolingual and multi-
lingual neural networks from Pappas and Popescu-
Belis (2017)8, noted as [PB17], namely:

7The word embeddings are not included in the parameters
statistics because they are not variables of the network.

8For reference, in Table 4 we also compare to a logistic
regression trained with unigrams over the full vocabulary and


Models Languages (en + aux → en) Languages (en + aux → aux) Stat.
Yg abbrev. de es pt uk ru ar fa de es pt uk ru ar fa avg

[P
B

17
]

M
on

o NN (Avg) 50.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.1 70.0 57.2 80.9 59.3 64.4 66.6 57.6
HNN (Avg) 70.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67.9 82.5 70.5 86.8 77.4 79.0 76.6 73.6
HAN (Att) 71.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.8 82.8 71.3 85.3 79.8 80.5 76.6 74.7

M
ul

ti MHAN-Enc 71.0 69.9 69.2 70.8 71.5 70.0 71.3 69.7 82.9 69.7 86.8 80.3 79.0 76.0 74.1
MHAN-Att 74.0 74.2 74.1 72.9 73.9 73.8 73.3 72.5 82.5 70.8 87.7 80.5 82.1 76.3 76.3
MHAN-Both 72.8 71.2 70.5 65.6 71.1 68.9 69.2 70.4 82.8 71.6 87.5 80.8 79.1 77.1 74.2

O
ur

s M
on

o GILE-NN (Avg) 60.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60.3 76.6 62.1 82.0 65.7 77.4 68.6 65.2
GILE-HNN (Avg) 74.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3 83.3 72.6 88.3 81.5 81.9 77.1 77.1
GILE-HAN (Att) 76.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74.2 83.4 71.9 86.1 82.7 81.0 77.2 78.0

M
ul

ti GILE-MHAN-Enc 75.1 74.0 72.7 70.7 74.4 73.5 73.2 72.7 83.4 73.0 88.7 82.8 83.3 77.4 76.7
GILE-MHAN-Att 76.5 76.5 76.3 75.3 76.1 75.6 75.2 74.5 83.5 72.7 88.0 83.4 82.1 76.7 78.0
GILE-MHAN-Both 75.3 73.7 72.1 67.2 72.5 73.8 69.7 72.6 84.0 73.5 89.0 81.9 82.0 77.7 76.0

Ys Models de es pt uk ru ar fa de es pt uk ru ar fa avg

[P
B

17
]

M
on

o NN (Avg) 24.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.8 22.1 24.3 33.0 26.0 24.1 32.1 25.3
HNN (Avg) 39.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39.6 37.9 33.6 42.2 39.3 34.6 43.1 38.9
HAN (Att) 43.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44.8 46.3 41.9 46.4 45.8 41.2 49.4 44.2

M
ul

ti MHAN-Enc 45.4 45.9 44.3 41.1 42.1 44.9 41.0 43.9 46.2 39.3 47.4 45.0 37.9 48.6 43.8
MHAN-Att 46.3 46.0 45.9 45.6 46.4 46.4 46.1 46.5 46.7 43.3 47.9 45.8 41.3 48.0 45.8
MHAN-Both 45.7 45.6 41.5 41.2 45.6 44.6 43.0 45.9 46.4 40.3 46.3 46.1 40.7 50.3 44.5

O
ur

s M
on

o GILE-NN (Avg) 27.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27.5 28.4 29.2 36.8 31.6 32.1 35.6 29.5
GILE-HNN (Avg) 43.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.4 42.0 37.7 43.0 42.9 36.6 44.1 42.2
GILE-HAN (Att) 45.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47.3 47.4 42.6 46.6 46.9 41.9 48.6 45.9

M
ul

ti GILE-MHAN-Enc 46.0 46.6 41.2 42.5 46.4 43.4 41.8 47.2 47.7 41.5 49.5 46.6 41.4 50.7 45.1
GILE-MHAN-Att 47.3 47.0 45.8 45.5 46.2 46.5 45.5 47.6 47.9 43.5 49.1 46.5 42.2 50.3 46.5
GILE-MHAN-Both 47.0 46.7 42.8 42.0 45.6 42.8 39.3 48.0 47.6 43.1 48.5 46.0 42.1 49.0 45.0

Table 3: Full-resource classification results on general (upper half) and specific (lower half) labels using mono-
lingual and bilingual models with DENSE encoders on English as target (left) and the auxiliary language as target
(right). The average bilingual F1-score (%) is noted avg and the top ones per block are underlined. The monolin-
gual scores on the left come from a single model, hence a single score is repeated multiple times; the repetition is
marked with consecutive dots.

• NN: A neural network which feeds the av-
erage vector of the input words directly to a
classification layer, as the one used by Kle-
mentiev et al. (2012).

• HNN: A hierarchical network with encoders
and average pooling at every level, followed
by a classification layer, as the one used by
Tang et al. (2015).

• HAN: A hierarchical network with encoders
and attention, followed by a classification
layer, as the one used by Yang et al. (2016).

• MHAN: Three multilingual hierarchical net-
works with shared encoders, noted MHAN-
Enc, shared attention, noted MHAN-Att,
and shared attention and encoders, noted
MHAN-Both, as the ones used by Pappas and
Popescu-Belis (2017).

To ensure a controlled comparison to the above
baselines, for each model we evaluate a ver-
sion where their output layer is replaced by our
generalized input-label embedding output layer

over the top-10% most frequent words by Mrini et al. (2017),
noted as [M17], which use the same settings and data.

using the same number of parameters; these
have the abbreviation “GILE” prepended in their
name (e.g. GILE-HAN). The scores of HAN and
MHAN models in Tables 3, 6 and 7 are the ones re-
ported by Pappas and Popescu-Belis (2017), while
for Table 4 we train them ourselves using their
code. Lastly, the best score for each pairwise com-
parison between a joint input-label model and its
counterpart is marked in bold.

4.2.2 Results
Table 3 displays the results of full-resource docu-
ment classification using DENSE encoders for both
general and specific labels. On the left, we display
the performance of models on the English sub-
corpus when English and an auxiliary language
are used for training, and on the right, the perfor-
mance on the auxiliary language sub-corpus when
that language and English are used for training.

The results show that in 98% of comparisons on
general labels (top half of Table 3) the joint input-
label models improve consistently over the cor-
responding models using a typical sigmoid clas-
sification layer. This finding validates our main
hypothesis that the joint input-label models suc-


Models Languages Statistics
abbrev. en de es pt uk ru ar fa nl fl

[M
17

] LogReg-BOW 75.8 72.9 81.4 74.3 91.0 79.2 82.0 77.0 26M 79.19
LogReg-BOW-10% 74.7 70.1 80.6 71.1 89.5 76.5 80.8 75.5 5M 77.35

[P
B

17
] HAN-BIGRU 76.3 74.1 84.5 72.9 87.7 82.9 81.7 75.3 377K 79.42

HAN-GRU 77.1 72.5 84.0 70.8 86.6 83.0 82.9 76.0 138K 79.11
HAN-DENSE 71.2 71.8 82.8 71.3 85.3 79.8 80.5 76.6 50K 77.41

O
ur

s GILE-HAN-BIGRU 78.1 73.6 84.9 72.5 89.0 82.4 82.5 75.8 377K 79.85
GILE-HAN-GRU 77.1 72.6 84.7 72.4 88.6 83.6 83.4 76.0 138K 79.80
GILE-HAN-DENSE 76.5 74.2 83.4 71.9 86.1 82.7 82.6 77.2 50K 79.12

Table 4: Full-resource classification results on general (Yg) topic labels with DENSE and GRU encoders. Reported
are also the average number of parameters per language (nl), and the average F1 per language (fl).

cessfully exploit the semantics of the labels, which
provide useful cues for classification, as opposed
to models which are agnostic to label semantics.
The results for specific labels (bottom half of Ta-
ble 3) demonstrate the same trend, with the joint
input-label models performing better in 87% of
comparisons.

In Table 5, we also directly compare our em-
bedding to previous bilinear input-label embed-
ding formulations when using the best monolin-
gual configuration (HAN) from Table 3, exactly
as done in Section 4.1. The results on the general
labels show that GILE outperforms the previous
bilinear input-label models, BIL [YH15] and BIL
[N16], by +1.62 and +3.3 percentage points on av-
erage respectively. This difference is much more
pronounced on the specific labels, where the label
set is much larger, namely +6.5 and +13.5 percent-
age points respectively. Similarly, our model with
constrained dimensionality is also as good or bet-
ter on average than the bilinear input-label models,
by +0.9 and +2.2 on general labels and by -0.5 and
+6.1 on specific labels respectively, which high-
lights the importance of learning non-linear rela-
tionships across encoded labels and documents.
Among our ablated model variants, as in previous
section, the best is the one with only the label pro-
jection but it still worse than our full model by -
5.2 percentage points. The improvements of GILE
against each baseline is significant and consistent
on both datasets. Hence, in the following experi-
ments we will only consider the best of these al-
ternatives.

The best bilingual performance on average is
that of the GILE-MHAN-Att model, for both gen-
eral and specific labels. This improvement can
be attributed to the effective sharing between la-

HAN Languages

Yg output layer en de es pt uk ru ar fa

Linear [PB17] 71.2 71.8 82.8 71.3 85.3 79.8 80.5 76.6

BIL [YH15] 71.7 70.5 82.0 71.1 86.6 80.6 80.4 76.0

BIL [N16] 69.8 69.1 80.9 67.4 87.5 79.9 78.4 75.1

GILE (Ours) 76.5 74.2 83.4 71.9 86.1 82.7 82.6 77.2

- constrained dj 73.6 73.1 83.3 71.0 87.1 81.6 80.4 76.4

- only label 71.4 69.6 82.1 70.3 86.2 80.6 81.1 76.2

- only input 55.1 54.2 80.6 66.5 85.6 60.8 78.9 74.0

Ys output layer en de es pt uk ru ar fa

Linear[PB17] 43.4 44.8 46.3 41.9 46.4 45.8 41.2 49.4

BIL [YH15] 40.7 37.8 38.1 33.5 44.6 38.1 39.1 42.6

BIL [N16] 34.4 30.2 34.4 33.6 31.4 22.8 35.6 38.9

GILE (Ours) 45.9 47.3 47.4 42.6 46.6 46.9 41.9 48.6

- constrained dj 38.5 38.0 36.8 35.1 42.1 36.1 36.7 48.7

- only label 38.4 41.5 42.9 38.3 44.0 39.3 37.2 43.4

- only input 12.1 10.8 8.8 20.5 11.8 7.8 12.0 24.6

Table 5: Direct comparison with previous bilin-
ear input-label models, namely BIL [YH15] and BIL
[N16], and with our ablated model variants using the
best monolingual configuration (HAN) from Table 3
on both general (upper half) and specific (lower half)
labels. Best scores among the competing models are
marked in bold.

bel semantics across languages through the joint
multilingual input-label output layer. Effectively,
this model has the same multilingual sharing
scheme with the best model reported by Pappas
and Popescu-Belis (2017), MHAN-Att, namely
sharing attention at each level of the hierarchy,
which agrees well with their main finding. In-
terestingly, the improvement holds when using
different types of hierarchical encoders, namely


Models General labels Specific labels
abbrev. # lang. nl fl nl fl

[P
B

17
] HAN 1 50K 77.41 90K 44.90

MHAN 2 40K 78.30 80K 45.72
MHAN 8 32K 77.91 72K 45.82

O
ur

s GILE-HAN 1 50K 79.12 90K 45.90
GILE-MHAN 2 40K 79.68 80K 46.49
GILE-MHAN 8 32K 79.48 72K 46.32

Table 6: Multilingual learning results. The columns
are the average number of parameters per language
(nl), average F1 per language (fl).

DENSE GRU, and biGRU, as shown in Table 4,
which demonstrate the generality of the approach.
In addition, our best models outperform logis-
tic regression trained either on top-10% most fre-
quent words or on the full vocabulary, even though
our models utilize many fewer parameters, namely
377K/138K vs. 26M/5M. Increasing the capacity
of our models should lead to even further improve-
ments.

Multilingual learning. So far, we have shown
that the proposed joint input-label models outper-
form typical neural models when training with one
and two languages. Does the improvement remain
when increasing the number of languages even
more? To answer the question we report in Table
6 the average F1-score per language for the best
baselines from the previous experiment (HAN and
MHAN-Att) with the proposed joint input-label
versions of them (GILE-HAN and GILE-MHAN-
Att) when increasing the number of languages (1,
2 and 8) that are used for training. Overall, we ob-
serve that the joint input-label models outperform
all the baselines independently of the number of
languages involved in the training, while having
the same number of parameters. We also replicate
the previous result that a second language helps
but beyond that there is no improvement.

Low-resource transfer. We investigate here
whether joint input-label models are useful for
low-resource languages. Table 7 shows the low-
resource classification results from English to
seven other languages when varying the amount of
their training data. Our model with both shared en-
coders and attention, GILE-MHAN, outperforms
previous models in average, namely HAN (Yang
et al., 2016) and MHAN (Pappas and Popescu-
Belis, 2017), for low-resource classification in the
majority of the cases.

Levels [PB17] Ours
range HAN MHAN GILE-MHAN

en
→

de

0.1-0.5% 29.9 39.4 42.9
1-5% 51.3 52.6 51.6

10-50% 63.5 63.8 65.9

en
→

es 0.1-0.5% 39.5 41.5 39.0
1-5% 45.6 50.1 50.9

10-50% 74.2 75.2 76.4

en
→

pt 0.1-0.5% 30.9 33.8 39.6
1-5% 44.6 47.3 48.9

10-50% 60.9 62.1 62.3

en
→

uk

0.1-0.5% 60.4 60.9 61.1
1-5% 68.2 69.0 69.4

10-50% 76.4 76.7 76.5

en
→

ru 0.1-0.5% 27.6 29.1 27.9
1-5% 39.3 40.2 40.2

10-50% 69.2 69.4 70.4

en
→

ar 0.1-0.5% 35.4 36.6 46.1
1-5% 45.6 46.6 49.5

10-50% 48.9 47.8 61.8
en
→

fa 0.1-0.5% 36.0 41.3 42.5
1-5% 55.0 55.5 55.4

10-50% 69.2 70.0 69.7

Table 7: Low-resource classification results with vari-
ous sizes of training data using the general labels.

The shared input-label space appears to be help-
ful especially when transferring from English to
German, Portuguese and Arabic languages. GILE-
MHAN is significantly behind MHAN on trans-
ferring knowledge from English to Spanish and to
Russian in the 0.1-0.5% resource setting, but in the
rest of the cases they have very similar scores.

Label sampling. To speedup computation it is
possible to train our model by sampling labels, in-
stead of training over the whole label set. How
much speed-up can we achieve from this label
sampling approach and still retain good levels of
performance? In Figure 2, we attempt to answer
this question by reporting the performance of our
GILE-HNN model when varying the amount of
labels (%) that it uses for training over English
general and specific labels of the DW dataset. In
both cases, the performance of GILE-HNN tends
to increase as the percentage of labels sampled in-
creases, but it levels off for the higher percentages.

For general labels, top performance is reached
with a 40% to 50% sampling rate, which translates
to a 22% to 18% speedup, while for the specific
labels, it is reached with a 60% to 70% sampling
rate, which translates to a 40% to 36% speedup.


Figure 2: Varying sampling percentage for general and
specific English labels. (Top) GILE-HNN is compared
against HNN in terms of F1 (%). (Bottom) The runtime
speedup over GILE-HNN trained on the full label set.

The speedup is correlated to the size of the label
set, since there are many fewer general labels than
specific labels, namely 327 vs 1,058 here. Hence,
we expect even higher speedups for bigger label
sets. Interestingly, GILE-HNN with label sam-
pling reaches the performance of the baseline with
a 25% and 60% sample for general and specific
labels respectively. This translates to a speedup of
30% and 50% respectively compared to a GILE-
HNN trained over all labels. Overall, these results
show that our model is effective and that it can also
scale to large label sets. The label sampling should
also be useful in tasks where the computation re-
sources may be limited or budgeted.

5 Related Work

5.1 Neural text Classification
Research in neural text classification was initially
based on feed-forward networks, which required
unsupervised pre-training (Collobert et al., 2011;
Mikolov et al., 2013; Le and Mikolov, 2014) and
later on they focused on networks with hierarchi-
cal structure. Kim (2014) proposed a convolu-
tional neural network (CNN) for sentence clas-
sification. Johnson and Zhang (2015) proposed
a CNN for high-dimensional data classification,
while Zhang et al. (2015) adopted a character-level
CNN for text classification. Lai et al. (2015) pro-
posed a recurrent CNN to capture sequential infor-
mation, which outperformed simpler CNNs. Lin
et al. (2015) and Tang et al. (2015) proposed hi-
erarchical recurrent neural networks and showed
that they were superior to CNN-based models.
Yang et al. (2016) demonstrated that a hierarchi-
cal attention network with bi-directional gated en-

coders outperforms previous alternatives. Pappas
and Popescu-Belis (2017) adapted such networks
to learn hierarchical document structures with
shared components across different languages.

The issue of scaling to large label sets has
been addressed previously by output layer approx-
imations (Morin and Bengio, 2005) and with the
use of sub-word units or character-level modeling
(Sennrich et al., 2016; Lee et al., 2017) which is
mainly applicable to structured prediction prob-
lems. Despite the numerous studies, most of the
existing neural text classification models ignore
label descriptions and semantics. Moreover, they
are based on typical output layer parametrizations
which are dependent on the label set size, and
thus are not able to scale well to large label sets
nor to generalize to unseen labels. Our output
layer parametrization addresses these limitations
and could potentially improve such models.

5.2 Output Representation Learning
There exist studies which aim to learn output rep-
resentations directly from data without any se-
mantic grounding to word embeddings (Srikumar
and Manning, 2014; Yeh et al., 2018; Augen-
stein et al., 2018). Such methods have a label-
set-size dependent parametrization, which makes
them data hungry, less scalable on large label sets
and incapable of generalizing to unseen classes.
Wang et al. (2018) addressed the lack of seman-
tic grounding to word embeddings by proposing
an efficient method based on label-attentive text
representations which are helpful for text classi-
fication. However, in contrast to our study, their
parametrization is still label-set-size dependent
and thus their model is not able to scale well to
large label sets nor to generalize to unseen labels.

5.3 Zero-shot Text Classification
Several studies have focused on learning joint
input-label representations grounded to word se-
mantics for unseen label prediction for images
(Weston et al., 2011; Socher et al., 2013; Norouzi
et al., 2014; Zhang et al., 2016; Fu et al., 2018),
called zero-shot classification. However, there are
fewer such studies for text classification. Dauphin
et al. (2014) predicted semantic utterances of text
by mapping them in the same semantic space with
the class labels using an unsupervised learning ob-
jective. Yazdani and Henderson (2015) proposed
a zero-shot spoken language understanding model
based on a bilinear input-label model able to gen-


eralize to previously unseen labels. Nam et al.
(2016), proposed a bilinear joint document-label
embedding which learns shared word representa-
tions between documents and labels. More re-
cently, Shu et al. (2017) proposed an approach for
open-world classification which aims to identify
novel documents during testing but it is not able
to generalize to unseen classes. Perhaps, the most
similar model to ours is from the recent study by
Pappas et al. (2018) on neural machine translation,
with the difference that they have single-word la-
bel descriptions and they use a label-set-dependent
bias in a softmax linear prediction unit, which is
designed for structured prediction. Hence, their
model can neither handle unseen labels nor multi-
label classification, as we do here.

Compared to previous joint input-label models,
the proposed model has a more general and flexi-
ble parametrization which allows the output layer
capacity to be controlled. Moreover, it is not re-
stricted to linear mappings, which have limited
expressivity, but uses nonlinear mappings, similar
to energy-based learning networks (LeCun et al.,
2006; Belanger and McCallum, 2016). The link to
the latter can be made if we regard P(ij)val in Eq. 11
as an energy function for the i-th document and the
j-th label, the calculation of which uses a simple
multiplicative transformation (Eq. 10). Lastly, the
proposed model performs well on both seen and
unseen label sets by leveraging the binary cross-
entropy loss, which is the standard loss for classi-
fication problems, instead of a ranking loss.

6 Conclusion

We proposed a novel joint input-label embedding
model for neural text classification which gener-
alizes over existing input-label models and ad-
dresses their limitations while preserving high per-
formance on both seen and unseen labels. Com-
pared to baseline neural models with a typical out-
put layer, our model is more scalable and has bet-
ter performance on the seen labels. Compared to
previous joint input-label models, it performs sig-
nificantly better on unseen labels without compro-
mising performance on the seen labels. These im-
provements can be attributed to the the ability of
our model to capture complex input-label relation-
ships, to its controllable capacity and to its training
objective which is based on cross-entropy loss.

As future work, the label representation could
be learned by a more sophisticated encoder, and

the label sampling could benefit from importance
sampling to avoid revisiting uninformative labels.
Another interesting direction would be to find a
more scalable way of increasing the output layer
capacity, for instance using a deep rather than
wide classification network. Moreover, adapting
the proposed model to structured prediction, for
instance by using a softmax classification unit in-
stead of a sigmoid one, would benefit tasks such
as neural machine translation, language model-
ing and summarization, in isolation but also when
trained jointly with multi-task learning.

Acknowledgments

We are grateful for the support from the Euro-
pean Union through its Horizon 2020 program
in the SUMMA project n. 688139, see http:
//www.summa-project.eu. We would also
like to thank our action editor, Eneko Agirre, and
the anonymous reviewers for their invaluable sug-
gestions and feedback.

References

Waleed Ammar, George Mulcaire, Yulia Tsvetkov,
Guillaume Lample, Chris Dyer, and Noah A.
Smith. 2016. Massively multilingual word em-
beddings. CoRR, abs/1602.01925.v2.

Isabelle Augenstein, Sebastian Ruder, and An-
ders Søgaard. 2018. Multi-task learning of
pairwise sequence classification tasks over dis-
parate label spaces. In Proceedings of the
2018 Conference of the North American Chap-
ter of the Association for Computational Lin-
guistics: Human Language Technologies, Vol-
ume 1 (Long Papers), pages 1896–1906, New
Orleans, Louisiana.

David Belanger and Andrew McCallum. 2016.
Structured prediction energy networks. In Pro-
ceedings of The 33rd International Conference
on Machine Learning, volume 48 of Proceed-
ings of Machine Learning Research, pages 983–
992, New York, New York, USA. PMLR.

Jianshu Chen, Ji He, Yelong Shen, Lin Xiao, Xi-
aodong He, Jianfeng Gao, Xinying Song, and
Li Deng. 2015. End-to-end learning of LDA
by mirror-descent back propagation over a deep
architecture. In Advances in Neural Informa-
tion Processing Systems 28, pages 1765–1773,
Montreal, Canada.

http://www.summa-project.eu
http://www.summa-project.eu
https://arxiv.org/abs/1602.01925.v2
https://arxiv.org/abs/1602.01925.v2
http://www.aclweb.org/anthology/N18-1172
http://www.aclweb.org/anthology/N18-1172
http://www.aclweb.org/anthology/N18-1172
http://proceedings.mlr.press/v48/belanger16.html
https://papers.nips.cc/paper/5967-end-to-end-learning-of-lda-by-mirror-descent-back-propagation-over-a-deep-architecture
https://papers.nips.cc/paper/5967-end-to-end-learning-of-lda-by-mirror-descent-back-propagation-over-a-deep-architecture
https://papers.nips.cc/paper/5967-end-to-end-learning-of-lda-by-mirror-descent-back-propagation-over-a-deep-architecture


Kyunghyun Cho, Bart van Merrienboer, Caglar
Gulcehre, Dzmitry Bahdanau, Fethi Bougares,
Holger Schwenk, and Yoshua Bengio. 2014.
Learning phrase representations using RNN
encoder–decoder for statistical machine transla-
tion. In Proceedings of the 2014 Conference on
Empirical Methods in Natural Language Pro-
cessing, pages 1724–1734, Doha, Qatar.

Ronan Collobert, Jason Weston, Léon Bottou,
Michael Karlen, Koray Kavukcuoglu, and Pavel
Kuksa. 2011. Natural language processing (al-
most) from scratch. Journal of Machine Learn-
ing Research, 12:2493–2537.

Yann N. Dauphin, Gökhan Tür, Dilek Hakkani-
Tür, and Larry P. Heck. 2014. Zero-shot learn-
ing and clustering for semantic utterance classi-
fication. In International Conference on Learn-
ing Representations, Banff, Canada.

Orhan Firat, Baskaran Sankaran, Yaser Al-
Onaizan, Fatos T. Yarman Vural, and
Kyunghyun Cho. 2016. Zero-resource
translation with multi-lingual neural machine
translation. In Proceedings of the 2016 Con-
ference on Empirical Methods in Natural
Language Processing, pages 268–277, Austin,
USA.

Andrea Frome, Greg S. Corrado, Jon Shlens,
Samy Bengio, Jeff Dean, Marc Aurelio Ran-
zato, and Tomas Mikolov. 2013. DeViSE: A
deep visual-semantic embedding model. In
C. J. C. Burges, L. Bottou, M. Welling,
Z. Ghahramani, and K. Q. Weinberger, edi-
tors, Advances in Neural Information Process-
ing Systems 26, pages 2121–2129. Curran As-
sociates, Inc.

Yanwei Fu, Tao Xiang, Yu-Gang Jiang, Xi-
angyang Xue, Leonid Sigal, and Shaogang
Gong. 2018. Recent advances in zero-shot
recognition: Toward data-efficient understand-
ing of visual content. IEEE Signal Processing
Magazine, 35(1):112–125.

Rie Johnson and Tong Zhang. 2015. Effective
use of word order for text categorization with
convolutional neural networks. In Proceedings
of the 2015 Conference of the North Ameri-
can Chapter of the Association for Computa-
tional Linguistics: Human Language Technolo-
gies, pages 103–112, Denver, Colorado.

Yoon Kim. 2014. Convolutional neural networks
for sentence classification. In Proceedings of
the 2014 Conference on Empirical Methods
in Natural Language Processing, pages 1746–
1751, Doha, Qatar.

Alexandre Klementiev, Ivan Titov, and Binod
Bhattarai. 2012. Inducing crosslingual dis-
tributed representations of words. In Pro-
ceedings of COLING 2012, pages 1459–1474,
Mumbai, India.

Ankit Kumar, Ozan Irsoy, Jonathan Su, James
Bradbury, Robert English, Brian Pierce, Pe-
ter Ondruska, Ishaan Gulrajani, and Richard
Socher. 2015. Ask me anything: Dynamic
memory networks for natural language process-
ing. In Proceedings of The 33rd International
Conference on Machine Learning, pages 334–
343, New York City, USA.

Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao.
2015. Recurrent convolutional neural networks
for text classification. In Proceedings of the
29th AAAI Conference on Artificial Intelligence,
pages 2267–2273, Austin, USA.

Quoc V. Le and Tomas Mikolov. 2014. Distributed
representations of sentences and documents. In
Proceedings of The 31st International Confer-
ence on Machine Learning, pages 1188âĂŞ–
1196, Beijing, China.

Yann LeCun, Sumit Chopra, Raia Hadsell, Fu Jie
Huang, and et al. 2006. A tutorial on energy-
based learning. In Predicting Structured Data.
MIT Press.

Jason Lee, Kyunghyun Cho, and Thomas Hof-
mann. 2017. Fully character-level neural ma-
chine translation without explicit segmentation.
Transactions of the Association for Computa-
tional Linguistics, 5:365–378.

Rui Lin, Shujie Liu, Muyun Yang, Mu Li, Ming
Zhou, and Sheng Li. 2015. Hierarchical recur-
rent neural network for document modeling. In
Proceedings of the 2015 Conference on Empir-
ical Methods in Natural Language Processing,
pages 899–907, Lisbon, Portugal.

Thang Luong, Hieu Pham, and Christopher D.
Manning. 2015. Effective approaches to
attention-based neural machine translation. In

https://aclanthology.coli.uni-saarland.de/papers/D14-1179/d14-1179
https://aclanthology.coli.uni-saarland.de/papers/D14-1179/d14-1179
https://aclanthology.coli.uni-saarland.de/papers/D14-1179/d14-1179
http://www.jmlr.org/papers/v12/collobert11a.html
http://www.jmlr.org/papers/v12/collobert11a.html
http://arxiv.org/abs/1401.0509
http://arxiv.org/abs/1401.0509
http://arxiv.org/abs/1401.0509
https://aclweb.org/anthology/D16-1026
https://aclweb.org/anthology/D16-1026
https://aclweb.org/anthology/D16-1026
http://papers.nips.cc/paper/5204-devise-a-deep-visual-semantic-embedding-model.pdf
http://papers.nips.cc/paper/5204-devise-a-deep-visual-semantic-embedding-model.pdf
https://doi.org/10.1109/MSP.2017.2763441
https://doi.org/10.1109/MSP.2017.2763441
https://doi.org/10.1109/MSP.2017.2763441
https://aclanthology.coli.uni-saarland.de/papers/N15-1011/n15-1011
https://aclanthology.coli.uni-saarland.de/papers/N15-1011/n15-1011
https://aclanthology.coli.uni-saarland.de/papers/N15-1011/n15-1011
https://aclanthology.info/papers/D14-1181/d14-1181
https://aclanthology.info/papers/D14-1181/d14-1181
http://www.aclweb.org/anthology/C12-1089
http://www.aclweb.org/anthology/C12-1089
http://proceedings.mlr.press/v48/kumar16.html
http://proceedings.mlr.press/v48/kumar16.html
http://proceedings.mlr.press/v48/kumar16.html
https://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/view/9745/9552
https://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/view/9745/9552
http://proceedings.mlr.press/v32/le14.html
http://proceedings.mlr.press/v32/le14.html
http://yann.lecun.com/exdb/publis/pdf/lecun-06.pdf
http://yann.lecun.com/exdb/publis/pdf/lecun-06.pdf
https://aclanthology.coli.uni-saarland.de/papers/Q17-1026/q17-1026
https://aclanthology.coli.uni-saarland.de/papers/Q17-1026/q17-1026
https://aclanthology.coli.uni-saarland.de/papers/D15-1106/d15-1106
https://aclanthology.coli.uni-saarland.de/papers/D15-1106/d15-1106
https://aclanthology.info/papers/D15-1166/d15-1166
https://aclanthology.info/papers/D15-1166/d15-1166


Proceedings of the 2015 Conference on Empir-
ical Methods in Natural Language Processing,
pages 1412–1421, Lisbon, Portugal.

Thomas Mensink, Jakob Verbeek, Florent Per-
ronnin, and Gabriela Csurka. 2012. Metric
learning for large scale image classification:
Generalizing to new classes at near-zero cost.
In Computer Vision – ECCV 2012, pages 488–
501, Berlin, Heidelberg. Springer Berlin Hei-
delberg.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S
Corrado, and Jeff Dean. 2013. Distributed
representations of words and phrases and their
compositionality. In C. J. C. Burges, L. Bottou,
M. Welling, Z. Ghahramani, and K. Q. Wein-
berger, editors, Advances in Neural Information
Processing Systems 26, pages 3111–3119. Cur-
ran Associates, Inc.

Frederic Morin and Yoshua Bengio. 2005. Hier-
archical probabilistic neural network language
model. In Proceedings of the Tenth Interna-
tional Workshop on Artificial Intelligence and
Statistics, pages 246–252.

Khalil Mrini, Nikolaos Pappas, and Andrei
Popescu-Belis. 2017. Cross-lingual transfer for
news article labeling: Benchmarking statistical
and neural models. In Idiap Research Report,
Idiap-RR-26-2017.

Jinseok Nam, Eneldo Loza Mencía, and Johannes
Fürnkranz. 2016. All-in text: Learning docu-
ment, label, and word representations jointly.
In Proceedings of the 13th AAAI Conference on
Artificial Intelligence, AAAI’16, pages 1948–
1954, Phoenix, Arizona.

Mohammad Norouzi, Tomas Mikolov, Samy Ben-
gio, Yoram Singer, Jonathon Shlens, Andrea
Frome, Greg Corrado, and Jeffrey Dean. 2014.
Zero-shot learning by convex combination of
semantic embeddings. In International Con-
ference on Learning Representations, Banff,
Canada.

Bo Pang and Lillian Lee. 2005. Seeing stars: Ex-
ploiting class relationships for sentiment cate-
gorization with respect to rating scales. In Pro-
ceedings of the 43rd Annual Meeting on As-
sociation for Computational Linguistics, pages
115–124, Ann Arbor, Michigan.

Nikolaos Pappas, Lesly Miculicich, and James
Henderson. 2018. Beyond weight tying: Learn-
ing joint input-output embeddings for neural
machine translation. In Proceedings of the
Third Conference on Machine Translation: Re-
search Papers, pages 73–83, Belgium, Brussels.
Association for Computational Linguistics.

Nikolaos Pappas and Andrei Popescu-Belis. 2017.
Multilingual hierarchical attention networks for
document classification. In Proceedings of the
Eighth International Joint Conference on Natu-
ral Language Processing (Volume 1: Long Pa-
pers), pages 1015–1025.

Alexander M. Rush, Sumit Chopra, and Jason
Weston. 2015. A neural attention model for
abstractive sentence summarization. In Pro-
ceedings of the 2015 Conference on Empiri-
cal Methods in Natural Language Processing,
pages 379–389, Lisbon, Portugal.

Rico Sennrich, Barry Haddow, and Alexandra
Birch. 2016. Neural machine translation of rare
words with subword units. In Proceedings of
the 54th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Pa-
pers), pages 1715–1725, Berlin, Germany.

Lei Shu, Hu Xu, and Bing Liu. 2017. DOC: Deep
open classification of text documents. In Pro-
ceedings of the 2017 Conference on Empiri-
cal Methods in Natural Language Processing,
pages 2911–2916, Copenhagen, Denmark. As-
sociation for Computational Linguistics.

Richard Socher, Milind Ganjoo, Christopher D.
Manning, and Andrew Y. Ng. 2013. Zero-
shot learning through cross-modal transfer. In
Proceedings of the 26th International Confer-
ence on Neural Information Processing Sys-
tems, NIPS’13, pages 935–943, Lake Tahoe,
Nevada.

Vivek Srikumar and Christopher D. Manning.
2014. Learning distributed representations for
structured output prediction. In Proceedings
of the 27th International Conference on Neu-
ral Information Processing Systems - Volume 2,
NIPS’14, pages 3266–3274, Cambridge, MA,
USA. MIT Press.

Duyu Tang, Bing Qin, and Ting Liu. 2015. Doc-
ument modeling with gated recurrent neural

https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality
https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality
https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality
https://www.iro.umontreal.ca/~lisa/pointeurs/hierarchical-nnlm-aistats05.pdf
https://www.iro.umontreal.ca/~lisa/pointeurs/hierarchical-nnlm-aistats05.pdf
https://www.iro.umontreal.ca/~lisa/pointeurs/hierarchical-nnlm-aistats05.pdf
https://publidiap.idiap.ch/index.php/publications/show/3642
https://publidiap.idiap.ch/index.php/publications/show/3642
https://publidiap.idiap.ch/index.php/publications/show/3642
https://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12058
https://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12058
https://arxiv.org/abs/1312.5650
https://arxiv.org/abs/1312.5650
https://aclanthology.coli.uni-saarland.de/papers/P05-1015/p05-1015
https://aclanthology.coli.uni-saarland.de/papers/P05-1015/p05-1015
https://aclanthology.coli.uni-saarland.de/papers/P05-1015/p05-1015
http://www.aclweb.org/anthology/W18-6308
http://www.aclweb.org/anthology/W18-6308
http://www.aclweb.org/anthology/W18-6308
http://aclweb.org/anthology/I17-1102
http://aclweb.org/anthology/I17-1102
https://aclanthology.coli.uni-saarland.de/papers/D15-1044/d15-1044
https://aclanthology.coli.uni-saarland.de/papers/D15-1044/d15-1044
https://aclanthology.coli.uni-saarland.de/papers/P16-1162/p16-1162
https://aclanthology.coli.uni-saarland.de/papers/P16-1162/p16-1162
https://www.aclweb.org/anthology/D17-1314
https://www.aclweb.org/anthology/D17-1314
https://papers.nips.cc/paper/5027-zero-shot-learning-through-cross-modal-transfer
https://papers.nips.cc/paper/5027-zero-shot-learning-through-cross-modal-transfer
http://dl.acm.org/citation.cfm?id=2969033.2969191
http://dl.acm.org/citation.cfm?id=2969033.2969191
https://doi.org/10.18653/v1/D15-1167
https://doi.org/10.18653/v1/D15-1167


network for sentiment classification. In Pro-
ceedings of the 2015 Conference on Empiri-
cal Methods in Natural Language Processing,
pages 1422–1432, Lisbon, Portugal. Associa-
tion for Computational Linguistics.

Guoyin Wang, Chunyuan Li, Wenlin Wang, Yizhe
Zhang, Dinghan Shen, Xinyuan Zhang, Ricardo
Henao, and Lawrence Carin. 2018. Joint em-
bedding of words and labels for text classifica-
tion. In Proceedings of the 56th Annual Meet-
ing of the Association for Computational Lin-
guistics (Volume 1: Long Papers), pages 2321–
2331. Association for Computational Linguis-
tics.

Jason Weston, Samy Bengio, and Nicolas Usunier.
2010. Large scale image annotation: Learn-
ing to rank with joint word-image embeddings.
Mach. Learn., 81(1):21–35.

Jason Weston, Samy Bengio, and Nicolas Usunier.
2011. WSABIE: Scaling up to large vocab-
ulary image annotation. In Proceedings of
the Twenty-Second International Joint Confer-
ence on Artificial Intelligence (Volume 3), pages
2764–2770, Barcelona, Spain.

Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong
He, Alex Smola, and Eduard Hovy. 2016. Hier-
archical attention networks for document clas-
sification. In Proceedings of the 2016 Confer-
ence of the North American Chapter of the As-
sociation for Computational Linguistics: Hu-
man Language Technologies, pages 1480–1489,
San Diego, California.

Majid Yazdani and James Henderson. 2015. A
model of zero-shot learning of spoken language
understanding. In Proceedings of the 2015
Conference on Empirical Methods in Natural
Language Processing, pages 244–249, Lisbon,
Portugal.

Chih-Kuan Yeh, Wei-Chieh Wu, Wei-Jen Ko, and
Yu-Chiang Frank Wang. 2018. Learning deep
latent spaces for multi-label classification. In In
Proceedings of the 32nd AAAI Conference on
Artificial Intelligence, New Orleans, USA.

Xiang Zhang, Junbo Zhao, and Yann LeCun.
2015. Character-level convolutional networks
for text classification. In Advances in Neural
Information Processing Systems 28, pages 649–
657, Montreal, Canada.

Yang Zhang, Boqing Gong, and Mubarak Shah.
2016. Fast zero-shot image tagging. In Pro-
ceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, Las Vegas,
USA.

https://doi.org/10.18653/v1/D15-1167
http://aclweb.org/anthology/P18-1216
http://aclweb.org/anthology/P18-1216
http://aclweb.org/anthology/P18-1216
https://doi.org/10.1007/s10994-010-5198-3
https://doi.org/10.1007/s10994-010-5198-3
https://ai.google/research/pubs/pub37180
https://ai.google/research/pubs/pub37180
https://aclanthology.coli.uni-saarland.de/papers/N16-1174/n16-1174
https://aclanthology.coli.uni-saarland.de/papers/N16-1174/n16-1174
https://aclanthology.coli.uni-saarland.de/papers/N16-1174/n16-1174
https://aclanthology.coli.uni-saarland.de/papers/D15-1027/d15-1027
https://aclanthology.coli.uni-saarland.de/papers/D15-1027/d15-1027
https://aclanthology.coli.uni-saarland.de/papers/D15-1027/d15-1027
https://arxiv.org/abs/1707.00418
https://arxiv.org/abs/1707.00418
http://papers.nips.cc/paper/5782-character-level-convolutional-networks-for-text-classification.html
http://papers.nips.cc/paper/5782-character-level-convolutional-networks-for-text-classification.html
https://arxiv.org/abs/1605.09759