key: cord-0287082-tqwuqhsv
authors: Ye, Zihan; Hu, Fuyuan; Lyu, Fan; Li, Linyan; Huang, Kaizhu
title: Disentangling Semantic-to-visual Confusion for Zero-shot Learning
date: 2021-06-16
journal: nan
DOI: 10.1109/tmm.2021.3089017
sha: 5085f0b7cc90d12083742d1af73336b82d2a5ae2
doc_id: 287082
cord_uid: tqwuqhsv

Using generative models to synthesize visual features from semantic distribution is one of the most popular solutions to ZSL image classification in recent years. The triplet loss (TL) is popularly used to generate realistic visual distributions from semantics by automatically searching discriminative representations. However, the traditional TL cannot search reliable unseen disentangled representations due to the unavailability of unseen classes in ZSL. To alleviate this drawback, we propose in this work a multi-modal triplet loss (MMTL) which utilizes multimodal information to search a disentangled representation space. As such, all classes can interplay which can benefit learning disentangled class representations in the searched space. Furthermore, we develop a novel model called Disentangling Class Representation Generative Adversarial Network (DCR-GAN) focusing on exploiting the disentangled representations in training, feature synthesis, and final recognition stages. Benefiting from the disentangled representations, DCR-GAN could fit a more realistic distribution over both seen and unseen features. Extensive experiments show that our proposed model can lead to superior performance to the state-of-the-arts on four benchmark datasets. Our code is available at https://github.com/FouriYe/DCRGAN-TMM.

Classical pattern recognition classifies images into categories only seen in the training stage [1] , [2] , [3] . In contrast, zero-shot learning (ZSL), one of the most active research topics in multimedia, aims at exploring unseen categories, which has recently drawn much attention [4] , [5] , [6] , [7] , [8] , [9] , [10] , [11] , [12] . Furthermore, Chao et al. propose the generalized zero-shot learning (GZSL) [13] in a more practical scenario. Different from ZSL, GZSL intends to recognize both seen and unseen classes during test time. Since ZSL/GZSL does not require a vast amount of new data, ZSL models could be utilized as an imitative solution in crucial and lifesaving situations, e.g. current COVID-19 literature search [14] , autonomous driving planning [15] , [16] .

To conduct zero-shot classification, researchers usually engage intermediate semantic features to bridge the gap from seen to unseen classes. Intermediate semantic features have many alternatives, including attribute annotations [7] , text representations from online text corpora [9] , and even gaze embedding [17] . Based on these semantic features, researchers have explored two dominated types of ZSL methods, i.e. embedding methods, and generative methods. Embedding methods learn a projection from single modal features to another modal space for similarity measurement [18] , [4] , [10] . In contrast, generative methods focus on learning realistic unseen visual distributions from semantic features. They take advantages of the expressive power of generative adversarial networks (GANs) [19] , [20] to generate plausible visual features for unseen classes [9] , [21] , [22] , [23] , [24] . In this way, ZSL can be converted to a conventional classification problem.

The connection between semantic and visual relationships is the key of the most ZSL/GZSL methods. Recently, researchers focus on how to define manually the constraints about the connection. For example, LsrGAN [25] claims that synthesized visual features of different classes should have a similar relationship to their semantic features. Thus, they propose to utilize the semantic relationship for guiding visual feature synthesis. However, semantic features might also be too ambiguous to be classified. Our previous work, SRGAN [22] investigates the visual relationships of different classes and argues that it could be used to rectify their over-smoothing semantics of some classes, and then, synthesize visual fea- tures from rectified semantic features. Other researchers also construct a re-representation space to align visual and semantic features simultaneously [12] . These methods focus on using single modal information, semantics or vision. Obviously, single modal information could be incomplete for classification. Using one modality to constrain the other is imperfect and would cause semantic-to-visual confusion. Thus, we pay our attention to two important questions: (1) how to find more disentangled/separable class representations by utilizing both semantic and visual information? (2) and consequently how to make the best use of disentangled representations for visual feature generation and recognition. Our idea is illustrated in Fig. 2 .

To answer the first question, we notice that recent studies have designed a series of methods for automatically discriminative representation search [22] , [11] , [25] , [12] , [10] where the triplet loss is often used [26] , [27] , [28] , [24] , [10] . For example, Latent Discriminative Features Learning (LDF) [10] recognizes unseen samples in semantic and latent semantic space, which is searched by triplet loss (TL). TL is usually considered in the same fashion among these methods, i.e. they train a metric network (MN) and search a representation space from seen visual features by regulating both interclass and intra-class distances. A well-designed MN would minimize the margin among intra-class samples and maximize the margin among inter-class samples. As a hypothetical result, these works suggest that the unseen classes could also form disentangled searched representations in the searched space.

In this paper, however, we find that TL may lead to a serious problem in ZSL/GZSL due to the inherited nature of ZSL, i.e. the unavailability of unseen visual features. Particularly, both visual feature extraction models and MN cannot access unseen features in ZSL. Since the feature extraction models are not trained for unseen classes, extracted visual distributions of different unseen classes would be overlapped, leading that unseen features will be entangled. Even if MN could search discriminative seen class representations from seen visual features well, the TL training may be highly fragile to out-of-training-distribution features, which is similar to other margin-based losses [29] , [30] . As a consequence, MN would produce non-separable and entangled unseen representations due to the under-fitting of unseen visual features, as shown in Fig. 1 (a) . As such, we entitle the problem as entangled unseen visual features problem. This problem prevents the ZSL models from achieving the original purpose of using triplet loss, i.e. minimizing the distance among samples of the same classes and maximizing the distance among samples of different classes.

In this work, we propose to address the entangled unseen visual features problem by mitigating the entangled input condition. To this end, we develop the novel multi-modal triplet loss (MMTL), which combines two modal features, visual and semantic, to form more complete class descriptions. Compared to the traditional TL, our MMTL can utilize multimodal information which can benefit disentanglement of the feature representations. Concretely, when visual features of samples from different classes are close, MMTL can utilize semantic information of samples to distinguish samples, and vice versa. Therefore, as shown in Fig. 1 (b) , our MMTL is capable of searching disentangled representations for all the classes, even when some unseen visual features of different classes are close. In the testing stage for unseen samples, we can take sampling methods to obtain unseen representations. Note that, due to the instability in training GAN and tripletbased loss, we train our MN and our generator separately in this work.

To answer the second question, we further design the Disentangled Class Representation Generative Adversarial Network (DCRGAN), trying to make the best use of searched representations. DCR-GAN integrates searched representations in all the stages of the ZSL pipeline, i.e. features synthesis, model training, and final classification. First of all, our gener-ator synthesizes visual features from semantics and searched representations. Next, for model training, we point out that general GAN-based ZSL adversarial loss is not applicative for ZSL since they adopt a classification loss to make synthesized visual features more discriminative. However, though such classification loss intends to make synthesized features more separable, our investigation indicates that they cause the real features mixed-up together [28] ; such learned synthesized features would lead to serious misclassification of real samples which are located in class boundaries. To tackle this problem, we propose our adversarial loss L W GAN −SR by integrating auxiliary information, i.e. semantic features and searched representations, into our critical loss instead of attaching a classification loss. In the final classification stage, we train three softmax classifiers in visual, semantic, and searched representations spaces, respectively. Our results show that such integration can largely improve the accuracy both in seen and unseen classes.

Overall, this work has three main contributions. 1. We argue that the traditional TL has the inherited shortcoming for ZSL called the entangled unseen visual feature problem, i.e., the traditional TL cannot search appropriately disentangled representations for unseen classes. 2. We propose the MMTL to mitigate the entangled unseen feature problem. MMTL could search more disentangled representations than the traditional TL, which can be utilized to generate a more realistic distribution. 3. We propose a novel GAN-based framework named Disentangled Class Representation Generative Adversarial Network (DCR-GAN) for ZSL. DCR-GAN is capable of searching disentangled representations that are readily integrated in all the parts of ZSL. DCR-GAN achieves not only a high accuracy for classifying unseen images but also leads to significant improvement for classifying seen images.

Zero-shot Learning (ZSL) [5] , [31] , [32] , [33] , [34] , [35] , [22] , [21] , [36] , [37] is one active research topic in multimedia, which aims at recognizing images from categories that are not included in the training set. Generalized Zero-Shot Learning proposed in [13] considers a more practical situation, in which both seen and unseen instances are mixed in the test data. One main challenge of ZSL/GZSL is that empirical risk minimization becomes unreliable [38] , since unseen visual samples are not available in the training stage. This challenge also occurs in other relevant problems, i.e., Few-Shot Learning [39] . To overcome the limitations, researchers utilize semantics as intermediate representations of unseen classes. Such semantics are often manually defined attributes [4] , word vectors [40] and text descriptions [9] . Other works also utilize gaze embedding, that is collected by non-experts, as semantics [17] , [41] .

Mainstreams ZSL methods can fall into embedding methods and generative methods. Embedding ZSL methods learn a visual-to-semantic embedding [18] , [4] , [10] , a semantic-tovisual embedding [42] , or an unified embedding space [43] , [12] . Generative ZSL methods focus on leveraging Generative Adversarial Network [19] and/or Variational Autoencoders (VAE) [44] to synthesize unseen visual features from semantic features.

Obviously, the quality of semantic features are the key of all the ZSL methods. Incomplete semantics would cause confusions of visual features generation. Semantic Rectifying GAN (SRGAN) [22] utilizes manually designed distance functions to rectify over-smoothing semantic features by visual similarities. Some embedding methods [10] and VAE-based methods [27] , [28] try to utilize the triplet loss to search automatically more discriminative representations from visual features.

Though previous embedding methods and VAE-based methods have introduced TL to augment class representations, GAN-based methods hardly take concentration on utilizing representations searched by TL or other metric learning (partially due to their notorious training instability).

In this work, we make an attempt to fill the gap. We focus on searching automatically disentangled representations to enhance the fidelity of synthesized features; this is significantly different from the previous ways that manually define constraints between visual and semantic spaces. We also identify a novel problem of the traditional TL, and design a framework to utilize the searched representation in training, synthesizing, and recognition stages.

The traditional TL was discussed by Google in FaceNet to search face representations for recognition [45] . It takes a metric network (MN) to project an anchor feature a, a positive feature p, and a negative feature n into the searched representation space. The anchor and positive features share the same class, while the anchor and negative features belong to different classes. MN aims to tighten up the margin between positive pairs (a, p), and widen the margin between negative pairs (a, n).

In previous ZSL works using TL, their MN are all trained by single-modal visual features. For example, one embedding ZSL method, Latent Discriminative Features Learning (LDF) [10] , utilizes TL to mine new latent semantic features from visual features. In generative methods, Entropybased Uncertainty calibration VAE (EUC-VAE) [27] and Over-Complete Distribution VAE (OCD-VAE) [28] integrate TL in VAE to enhance the separability of encoded representations. EUC-VAE designs two TLs trained by visual features and semantic features, respectively. OCD-VAE develops an online batch TL to speed up the process of gradient backward, but it still adopts the same TL formulation as that of LDF.

The above mentioned methods all ignore the entangled unseen visual features problem. Traditional TL is highly fragile for out-of-training-distribution features, which is similar to other margin-based losses [29] . In ZSL, TL is required to search representations not only for seen classes, but also for unseen features. However, unseen visual features are entangled. Traditional TL in ZSL lacks the ability in defense of overlaps among unseen distributions. Differently, in this paper, we develop the Multi-Modal Triplet Loss (MMTL) that can mitigate the entangled problem by concatenating multimodal features to form more complete descriptions of unseen classes. Benefiting from other modal information, the unseen distributions do not overlap. Consequently, our MN trained by MMTL can search disentangled representations which are usually entangled in the traditional TL. As such, MMTL can better meet the intention of using margin-based losses, i.e. maximizing inter-class variation and minimizing intra-class variation.

The training pipeline of our model follows two steps:

(1) Pre-training Metric Network (MN, or in short M ) for searching disentangled representations, and Semantic Rectify Network (SRN or in short R) for sampling searched representations from the semantic space. This step will be described in Section III-B and be summarized in Algorithm 1.

(2) Training a visual feature generator G with a discriminator D to synthesize pseudo visual features. We also utilize two regressors F 1 and F 2 to enhance the multimodal consistencies among visual space, semantic space, and searched representation space. This step will be introduced in Section III-C and III-D, and be summarized in Algorithm 2. Once G is trained , we can train the final ZSL classifier with synthesized unseen features. Previous generative ZSL methods only train a visual ZSL classifier. Differently in this work, we also train a semantic and a searched representation classifier to make the best of auxiliary information. The test step will be presented in Section III-E.

Given an image I, the proposed model can recognize it as a specific class c even if it is unseen during training. We take instance {x, a s , c s } as input in the training stage, where x describes the instance-level visual feature in the visual feature space V, a s in the seen semantic space, A s is classlevel semantic extracted from attributes or other description information, and c s denotes the corresponding seen class label. C s is the set of seen class labels. In the testing stage, given an image, ZSL and GZSL will recognize it as an unseen class C u or a class c s+u (either seen or unseen). The unseen semantic space and the whole semantic space are denoted as A u , and A = A s ∪ A u respectively.

B. Multi-modal Triplet Loss and Sampling Strategy 1) Triplet loss in ZSL: One primary obstacle of generative methods for ZSL mainly comes from the incomplete class semantic features. Such semantic features would confuse the model, as well as generating less reliable visual features. To search more comprehensive class representations, previous works try to engage TL to search discriminative class representations. Given an anchor visual feature x s a with its label 

for t = 1, · · · , ||C s || do 8: Sample all seen visual features x s , and matching semantic features a s of a certain class in C s . 9: Compute the sampling loss L sam using Eq. 4. 10 :

end for 12: end for 13: fix θ M and θ R Algorithm 2 Training algorithm of feature generator. Require: The maximal loops N loop ; the batch size m; the iteration number of discriminator in a loop N d ; the iteration number of generator N g ; initial generator parameters θ G ; initial discriminator parameters θ D ; the trained semantic rectifying Network R; the gradient penalty hyperparameter λ; the two reconstruction parameters λ 1 and λ 2 ; Adam hyper-parameters α, β 1 , and β 2 . 1: for iter = 1, · · · , N loop do 2:

for iter = 1, · · · , N d do Compute the discriminator loss L D using Eq. 9.

for iter = 1, · · · , N g do 8:

Sample a mini-batch of seen visual features x s , corresponding semantic features a s , and random noise z 9:

Compute the reconstruction loss L F1 using Eq. 7.

10:

Compute the reconstruction loss L F2 using Eq. 8. 12 :

Compute the generator loss L G using Eq. 9.

14: 

where e is a concatenated feature from multiple modalities, e.g. vision, semantics, and/or gaze embedding. In this work, we concatenate visual and semantic modalities, i.e. e s = [x s , a s ] where [·, ·] denotes the concatenation operation.

Compared to the traditional TL, our MN can utilize multimodal information to search a latent space which is sharp enough to distinguish different unseen classes. Obviously, when visual features of samples from different classes are close, MN can utilize semantic information of samples to distinguish samples, and vice versa. Additionally, we also use the weight decay to prevent over-fitting. The total loss of training MN is:

Then, we can use sampling methods to get unseen representations from MN for generator training.

In recognizing unseen samples, we cannot know their labels before recognition. However, for a certain unseen class that has N c samples, we need to recognize all visual features of the class to produce its class-level representation 1

. It is completely reversed with the training process of MN, whether using TL or MMTL. Thus, we need to design a sampling representation strategy to get unseen representations.

The traditional LDF [10] method designs a sampling method by training a relationship matrix W that maps all seen semantics A s to all unseen semantics A s , i.e. A u ≈ W · A s . Then, unseen representations M (a u ) can be obtained from the matrix and searched seen representations, i.e. M (A u ) = W · M (A s ). Such sampling strategy has several drawbacks: (1) It is a transductive method since it takes unseen semantics to train.

(2) It may bring incorrect semantic relationships into the searched representation space. For example, semantics of some classes are too smoothing to be distinguished [22] . (3) It forces that the searched representations have the same dimension with semantic features.

In contrast, instead of learning the whole seen-unseen relationships from semantics, we train another network, Semantic Rectifying Network (SRN), for directly mapping semantics, no matter whether seen or unseen, to the searched representation space, as shown in Fig. 4 . Our sampling strategy need not unseen semantics in training. Thus it is inductive. Moreover, it can determine flexibly the dimensions of the searched representations. Our experiments also verify that the dimensions of searched representations can affect the ZSL classification.

Specifically, MN will be fixed after we finish its training. For any seen class, we minimize the l 2 loss between the average searched class representation and rectified semantic feature by SRN with the weight decay, i.e.

where N c is the number of all samples of the chosen seen class. Then, for an unseen class u, we can directly get its searched class representation by rectifying its semantic feature, i.e. R(a u i ). SRN and MN consist of a multi-layer perceptron (MLP) activated by Leaky ReLU, and the output layer does not apply any activation.

Generative Adversarial Network has demonstrated its usefulness for ZSL [9] , [21] , [22] , [23] , [24] , [46] , due to its promising ability to generate visual features from semantic features. The most popular generative ZSL methods are based on the conditional WGAN architecture with gradient penalty, which consists of a generator G, a discriminator D, and a classifier. The generator G synthesizes visual features from semantic features and normal distribution z ∼ N (0, 1), The discriminator D distinguishes the synthesized samples x f ake from real samples x. The classifier predicts the probabilities of their label, logP (y|x f ake ) and logP (y|x). The classifier, G and D are trained at the same time by the following minimax objective:

where E(·) denotes the expected value, x f ake = G(a, z), λ is a parameter, and the last term λ( ∇xD(x) 2 − 1) 2 is the gradient penalty to enforce the Lipschitz constraint [47] , in whichx = µx + (1 − µ)x f ake with µ ∼ U (0, 1). However, indiscriminately feeding vague semantic features into a generator may undermine the generated visual features. By a pre-trained SRN model, we can easily obtain more distinguished class representations. Therefore, we design a feature GAN model that translates these rectified semantic features into visual features.

L W GAN has two limitations: (1) It does not consider that semantic and visual space are heteroid. Some information might be missing completely in the other modal space. (2) Due to the entangled unseen visual features problem, many visual features are in the boundary area of other classes. In order to reduce the classification risk, G only generates samples that are far from the boundaries of classification, and thus, does not generate hard samples.

To address this problem, we propose the WGAN with the searched representations loss L W GAN −SR :

where x f ake = G(a, R(a), z). The training process of our DCR-GAN is described in Fig. 5 . Our L W GAN −SR enjoys two differences with L W GAN : (1) We integrate searched representations to align two modalities. (2) We remove the classifier and leverage auxiliary information, i.e. semantic features and searched representations, to train a class-sensitive discriminator D. With the integrated auxiliary information, interlaced class boundaries are pushed off. In this case, our generator G does not worry about the classification risk for hard samples.

With the above process, our model is able to synthesize proper visual features to some extent. However, there exists a significant problem, e.g. the generated visual features have poor consistencies with the input semantics and searched representations. Accordingly, we utilize two regression networks to keep the consistencies of semantic → visual → semantic space and searched representation → visual → searched representation space, respectively. Specifically, the regression network F 1 , keeping in step with the generator G, takes the generated feature x f ake as input, and builds consistency losses between original semantics and reconstructed semantics from visual features. The regression network F 2 works in the same way for searched representations. The two regression loss for F 1 and F 2 can be computed by

Obviously, G • F 1 and G • F 2 can be considered as two Auto-Encoders [44] , where A • B denotes the composite of two mappings. The reconstruction G•F 1 enhances the relationship between the synthetic visual features and the corresponding class semantics by minimizing the difference between the reconstructed and original semantic features. The searched representation reconstruction G • F 2 works in the same way. Finally, by integrating the reconstruction losses, the new objective of our DCR-GAN can be modified as:

where λ 1 and λ 2 are two reconstruction parameters for semantic and searched representation, respectively. Fig. 6 . Overview of zero-shot classification. We use our trained generator for unseen visual feature synthesis. Then, the synthesized feature is used to train a softmax classifier. Analogously, a semantic classifier and searched representation classifier are trained. We also use F 1 and F 2 to map all the real unseen visual features into the semantic space and searched representation space, respectively. We integrate the final results from visual, semantic, and searched representation space.

By Eq. 9, we can train a GAN generator G which is able to synthesize virtual visual features for unseen categories. Then, we use the synthesized features to train a classifier, e.g. softmax, to recognize the real unseen instances. In other words, zero-shot learning is converted into a supervised classification problem that is performed in the visual space. As shown in Fig. 6 , we train a softmax classifier by synthesized unseen visual features. We also use F 1 and F 2 to map all the real unseen visual features into the semantic space and searched representation space, respectively. Analogously, a semantic classifier and searched representation classifier are trained. Thus, in our model, we take full advantages of the visual space prediction score f V S , semantic space prediction score f SS , and searched representation space prediction score f SRS . Finally, we get the final classification scores as:

where ω 1 and ω 2 are two parameters used to balance the three terms, as shown in Fig. 6 . For GZSL, the main steps are the same as ZSL. The only difference lies in the final classification process. Specifically, the testing data of GZSL are from both the seen and unseen categories. Thus, we need to train a classifier on both real seen features and synthesized unseen features.

We evaluate our approach on four benchmark datasets for ZSL and GZSL: (1) Caltech-UCSD-Birds 200-2011 (CUB) [48] consisting of 11,788 images of 200 classes of birds annotated with 312 binary attributes; (2) Animals with Attributes (AWA) [7] consisting of 30,475 images of 50 animals classes with 85 attributes; (3) Attribute Pascal and Yahoo (APY) [49] containing of 15,339 images, 32 classes and 64 attributes from both PASCAL VOC 2008 dataset and Yahoo image search engine; (4) SUN Attribute (SUN) [50] annotating 102 attributes on 14,340 images from 717 types of scene. For the four datasets, we use the widely-used ZSL and GZSL split proposed in [5] . For clarity, the statistics of these datasets are summarized in Table I. We adopt the evaluation metrics proposed in [5] . For ZSL, we measure average per-class top-1 accuracy (T 1) of unseen classes C u . It is defined as follows:

For GZSL, we compute the average per-class top-1 accuracy of seen classes C s , denoted by S, the average per-class top-1 accuracy of unseen classes C u , denoted by U , and their harmonic mean, i.e. H = 2 × (S × U )/(S + U ).

We compare our model with the recent state-of-the-arts published in the last few years. Embedding methods include DEVISE [18] (NeurIPS13), DAP [4] (TPAMI14), SSE [43] (ICCV15), SJE [34] (CVPR15), ESZSL [51] (ICML15), ALE [6] (TPAMI16), LATEM [52] (CVPR16), SYNC [53] (CVPR16), SAE [32] (CVPR17), CRNet [55] (ICML19) and DVBE [56] (CVPR20); generative methods include GAZSL (CVPR18) [9] , PSR (CVPR18) [54] , f-CLSWGAN [21] (CVPR18), CDL [12] (ECCV18), SRGAN [22] (ICME19), GDAN [24] (CVPR19), DASCN [57] (NeurIPS19), AFC-GAN [58] (ACM MM19), OCD-VAE [28] (CVPR20), EUC-VAE [27] (arxiv21) and LsrGAN [25] (ECCV20). The results of ZSL and GZSL are reported in Table II . We first take an overall comparison with the state-of-the-arts in Section IV-B1 for ZSL and GZSL. Then, we compare our approach with others in Section IV-B2 and Section IV-B3 from two aspects, GAN-based and triplet-loss-based viewpoints, respectively. 1) (Generalized) Zero-shot Learning: For ZSL, from the results reported in Table II , we can see that we obtain 2.5%, 0.1% improvements on APY and SUN, respectively, against the previous state-of-the-art.

For GZSL, we follow previous work [5] and report the harmonic mean that can avoid the effects of extreme values. For instance, we can see from the results that SAE gets 1.8% for unseen and 77.1% for seen on CUB. Although the accuracy of seen classes is the best, the harmonic mean is only 3.5% due to the extremely low result on unseen categories. In a nutshell, the harmonic mean is high only if the accuracies on both seen and unseen categories are high. From the results, we can observe that our method achieves overall the best harmonic mean on all of the evaluations besides SUN. It indicates that our DCR-GAN is a stable method which can work well for both seen and unseen instances. In details, DCR-GAN outperforms the state-of-the-art LsrGAN with 4.6%, 7.8% and 1.5% improvements on AWA1, CUB and SUN, respectively. Notably, our method achieves the best on AWA1, CUB of the unseen categories, which significantly outperforms AFC-GAN by 4.5% and 2.3%. It is worth mentioning that we do not use any explicit constraint to avoid "train bias" problem, but our proposed model can still surpass AFC-GAN that uses the boundary loss to compel synthesized unseen features to be far away from seen features.

2) Comparison with GAN-based Approaches: The GANbased methods, f-CLSWGAN, cycle-CLSWGAN, AFC-GAN, GAZSL, SRGAN and LsrGAN, share the same basic loss L W GAN . Benefiting from the prior semantic features to generate missing data, these GAN-based methods can obtain better performance than those earlier embedding approaches though one recent embedding method, i.e. DVBE also demonstrates excellent performance. In addition, GDAN is a new method that unifies generative, embedding, and metric learning as a basic architecture . Benefiting from the change in the basic architecture, GDAN demonstrates an excellent score S for seen classification in SUN, which indicates a better basic loss may be necessary.

Our approach DCR-GAN adopts the proposed loss L W GAN −SR . Comparing our DCR-GAN with other GANbased methods, we observe that our method leads to competitive performance with others both in ZSL and GZSL. More concretely, just SRGAN and our DCR-GAN can recognize more than 70% unseen samples of AWA1 in ZSL. Although DCR-GAN cannot beat GDAN on SUN in the GZSL setting, it attains 63.7% and shows much better performance than GDAN and all the other methods on SUN for ZSL. These outstanding performance of our DCR-GAN demonstrates the effectiveness of our proposed basic loss L W GAN −SR .

It is noted that GDAN appears to perform much better than all the other methods including DCR-GAN on SUN in terms of H. Exploiting an adversarial loss L GDAN to train the model, it is observed that GDAN may overfit seen samples almost on all the datasets. This results in its excellent performance S for seen classes (particularly S = 89.9% in SUN), while the accuracy of U appears much worse: lower than almost all the other GAN-based methods. Such drawback may limit its application in ZSL/GZSL where recognition of unseen samples may be more crucial. In contrast, our proposed adversarial loss L W GAN −SR appears to achieve an outstanding balance for both seen and unseen classes. As a matter of fact, our method outperforms GDAN in CUB and APY in both U and H; it is also better than GDAN in terms of U in SUN.

3) Comparison with Triplet-loss-based Approaches: Besides our DCR-GAN, the methods OCD-VAE and EUC-VAE also introduce TL for searching discriminative latent features. Our DCR-GAN exploits the proposed MMTL, while the other methods use the traditional single-modal TL. As observed, for ZSL, DCR-GAN outperforms the others on APY and SUN; for GZSL, the proposed method demonstrates the best performance on AWA1, CUB and APY. It is noted that, although DCR-GAN performs not as well as them on SUN for GZSL, our model generates better performance for unseen recognition. Namely, our DCR-GAN recognizes 47.1% unseen samples in SUN, while OCD-VAE and EUC-VAE only recognize 44.8% and 35.0%, respectively.

To further verify the effectiveness of our approach, we take an ablation study on the searched class representations. Table III reports three variants in the setting of GZSL, respectively. We also provide convergence curve of three visual classifiers in Fig. 7 . The variant A indicates the traditional WGAN-based ZSL method trained by:

where x f ake = G(a, z). To fairly verify the effectiveness of the searched representations, we also train the variant B. We only change x f ake as G(a, R(a), z), and add in its reconstruction loss. In a word, the variant B is trained by:

where x f ake = G(a, R(a), z). These results validate the effectiveness of our proposed L W GAN −SR .

In order to provide a qualitative evaluation on our proposed DCR-GAN, we first visualize two kinds of unseen class representations as shown in Fig. 8 , where (a) is one kind of augmented semantic learned by a triplet loss only from the visual space, i.e. the semantic augmentation method of LDF [10] . Clearly, this augmented semantic is muddled. Various categories are mixed in the representation space. Fig. 8 (b) shows the visualization of our searched class representations, which are learned both from visual and semantic spaces. As observed, by utilizing the original semantic information, our model decouples searched class representations where boundaries between each category are clear. Note that our model does not use any unseen features in training. This confirms our idea -the original semantic information is also necessary in the augmented semantic searching process.

In addition, we visualize synthetic image features along with real image features. These results are illustrated in Fig. 9 . Since the numbers of categories of CUB and SUN are too large to visualize, we only show the results of seen classes of APY and unseen classes of APY, AWA1 and CUB. For each class we synthesize 100 features, and then we use t-SNE [59] to reduce the dimension to two for visualization. The synthesized features of i-th class are marked by fi, and real features are marked by ri. It is evident that our searched class representations successfully help DCRGAN to synthesize more realistic visual features than those of the baseline. Some synthesized features by our DCRGAN are almost the same as real features, e.g. 14-th seen class in APY, 1-st unseen class in APY, 5-th unseen class in AWA1 and 28-th of unseen classes in CUB. This visualization again indicates the identified entangled unseen visual feature problem problem. More importantly, our searched class representation can help generative models to fit more realistic distribution.

In this paper, we argue that the entangled unseen visual feature problem exists in the current Zero-shot Learning (ZSL) and Generalized ZSL. We propose our generative framework DCR-GAN to address the problem. DCR-GAN contains two novel loss: a multi-modal triplet loss (MMTL), and an adversarial loss L W GAN −SR . Compared with the traditional TL, MMTL is capable of searching more decoupled unseen class representations. Our designed L W GAN −SR can reduce high risks of hard sample generation. Benefiting from our MMTL and L W GAN −SR , our model learns more realistic distribution and generates more disentangled features. With the searched class representations, our DCR-GAN synthesizes visual features from semantic features and searched class representation. Given synthesized visual features, we train a softmax classifier for the visual space. Additionally, we ensemble the semantic and searched class representation softmax with the visual one. Experimental results show that the proposed approach achieves state-of-the-art performance on ZSL task and boosts the performance by a great margin for Generalized ZSL.

Deep residual learning for image recognition

Attend and imagine: Multilabel image classification with visual attention and recurrent neural networks

Picking neural activations for fine-grained recognition

Attribute-based classification for zero-shot visual object categorization

Zero-shot learning -the good, the bad and the ugly

Label-embedding for image classification

Learning to detect unseen object classes by between-class attribute transfer

Learning deep representations of fine-grained visual descriptions

A generative adversarial approach for zero-shot learning from noisy texts

Discriminative learning of latent features for zero-shot recognition

Selective zero-shot classification with augmented attributes

Learning class prototypes via structure alignment for zero-shot recognition

An empirical study and analysis of generalized zero-shot learning for object recognition in the wild

Sledge-z: A zero-shot baseline for covid-19 literature search

Can autonomous vehicles identify, recover from, and adapt to distribution shifts

Zerovirus: Zero-shot vehicle route understanding system for intelligent transportation

Gaze embeddings for zero-shot image classification

Devise: A deep visual-semantic embedding model

Generative adversarial nets

Unsupervised object transfiguration with attention

Feature generating networks for zero-shot learning

Sr-gan: Semantic rectifying generative adversarial network for zero-shot learning

Multimodal cycle-consistent generalized zero-shot learning

Generative dual adversarial network for zero-shot learning

Leveraging Seen and Unseen Semantic Relationships for Generative Zero-Shot Learning

Distance metric learning for large margin nearest neighbor classification

Entropy-based uncertainty calibration for generalized zero-shot learning

Generalized zero-shot learning via over-complete distribution

In defense of the triplet loss again: Learning robust person re-identification with fast approximated triplet loss and label distillation

Multidomain multi-task rehearsal for lifelong learning

Generative zero-shot learning via lowrank embedded semantic dictionary

Semantic autoencoder for zeroshot learning

Zero-shot learning via joint latent similarity embedding

Evaluation of output embeddings for fine-grained image classification

Zero-shot hashing via transferring supervised knowledge

Joint intermodal and intramodal label transfers for extremely rare or unseen classes

Weakly-shared deep transfer networks for heterogeneous-domain knowledge propagation

Generalizing from a few examples: A survey on few-shot learning

Attribute-guided feature learning for few-shot image recognition

Distributed representations of words and phrases and their compositionality

Goal-oriented gaze estimation for zero-shot learning

Ridge regression, hubness, and zero-shot learning

Zero-shot learning via semantic similarity embedding

Auto-Encoding Variational Bayes

Facenet: A unified embedding for face recognition and clustering

From zero-shot learning to conventional supervised classification: Unseen visual data synthesis

Wasserstein generative adversarial networks

The caltech-ucsd birds-200-2011 dataset

Describing objects by their attributes

The sun attribute database: Beyond categories for deeper scene understanding

An embarrassingly simple approach to zero-shot learning

Latent embeddings for zero-shot classification

Synthesized classifiers for zero-shot learning

Preserving semantic relations for zeroshot learning

Co-representation network for generalized zeroshot learning

Domainaware visual bias eliminating for generalized zero-shot learning

Dual adversarial semantics-consistent network for generalized zero-shot learning

Alleviating feature confusion for generative zero-shot learning

Visualizaing data using t-sne

His research interests include zero-shot learning, generative adversarial network, computer vision, and deep learning

Belgium, a Ph.D. student at Northwestern Polytechnical University, and a visiting Ph.D. student at the City University of Hong Kong. He is a professor at Suzhou University of Science and Technology. His research interests include machine learning, graphical models

He is working toward the PhD degree in the College of Intelligence and Computing

Linyan Li is currently an associate Professor at Suzhou Institute of Trade & Commerce, China. She obtained her Master degree from Wuhan University in 2007.She has been working in machine learning, neural information processing

He was the recipient of 2011 Asia Pacific Neural Network Society Young Researcher Award. He received best paper or book award five times