key: cord-0177880-bg8q57tk
authors: Kim, Junhan; Shim, Kyuhong; Shim, Byonghyo
title: Semantic Feature Extraction for Generalized Zero-shot Learning
date: 2021-12-29
journal: nan
DOI: nan
sha: ebeae49c085faf85548079c9f1af9cafdce81037
doc_id: 177880
cord_uid: bg8q57tk

Generalized zero-shot learning (GZSL) is a technique to train a deep learning model to identify unseen classes using the attribute. In this paper, we put forth a new GZSL technique that improves the GZSL classification performance greatly. Key idea of the proposed approach, henceforth referred to as semantic feature extraction-based GZSL (SE-GZSL), is to use the semantic feature containing only attribute-related information in learning the relationship between the image and the attribute. In doing so, we can remove the interference, if any, caused by the attribute-irrelevant information contained in the image feature. To train a network extracting the semantic feature, we present two novel loss functions, 1) mutual information-based loss to capture all the attribute-related information in the image feature and 2) similarity-based loss to remove unwanted attribute-irrelevant information. From extensive experiments using various datasets, we show that the proposed SE-GZSL technique outperforms conventional GZSL approaches by a large margin.

Image classification is a long-standing yet important task with a wide range of applications such as autonomous driving, industrial automation, medical diagnosis, and biometric identification (Fujiyoshi, Hirakawa, and Yamashita 2019; Ren, Hung, and Tan 2017; Ronneberger, Fischer, and Brox 2015; Sun et al. 2013 ). In solving the task, supervised learning (SL) techniques have been popularly used for its superiority (Simonyan and Zisserman 2014; He et al. 2016) . Well-known drawback of SL is that a large number of training data are required for each and every class to be identified. Unfortunately, in many practical scenarios, it is difficult to collect training data for certain classes (e.g., endangered species and newly observed species such as variants of . When there are unseen classes where training data is unavailable, SL-based models are biased towards the seen classes, impeding the identification of the unseen classes.

Recently, to overcome this drawback, a technique to train a classifier using manually annotated attributes (e.g., color, size, and shape; see Fig. 1 ) has been proposed (Lampert, Nickisch, and Harmeling 2009; Chao et al. 2016) . Key idea Copyright © 2022, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Figure 1 : Images and attributes for different bird species sampled from the CUB dataset (Welinder et al. 2010) .

of this technique, dubbed as generalized zero-shot learning (GZSL), is to learn the relationship between the image and the attribute from seen classes and then use the trained model in the identification of unseen classes. In (Akata et al. 2015) , for example, an approach to identify unseen classes by measuring the compatibility between the image feature and attribute has been proposed. In (Mishra et al. 2018) , a network synthesizing the image feature from the attribute has been employed to generate training data of unseen classes. In extracting the image feature, a network trained using the classification task (e.g., ResNet (He et al. 2016) ) has been popularly used. A potential drawback of this extraction method is that the image feature might contain attribute-irrelevant information (e.g., human fingers in Fig. 1 ), disturbing the process of learning the relationship between the image and the attribute (Tong et al. 2019; Han, Fu, and Yang 2020; Li et al. 2021) .

In this paper, we propose a new GZSL technique that removes the interference caused by the attribute-irrelevant information. Key idea of the proposed approach is to extract the semantic feature, feature containing the attributerelated information, from the image feature and then use it in learning the relationship between the image and the attribute. In extracting the semantic feature, we use a modified autoencoder consisting of two encoders, viz., semantic and residual encoders (see Fig. 2 ). In a nutshell, the semantic encoder captures all the attribute-related information in the image feature and the residual encoder catches the attributeirrelevant information.

In the conventional autoencoder, only reconstruction loss (difference between the input and the reconstructed input) is used for the training. In our approach, to encourage the semantic encoder to capture the attribute-related information only, we use two novel loss functions on top of the reconstruction loss. First, we employ the mutual information (MI)-based loss to maximize (minimize) MI between the semantic (residual) encoder output and the attribute. Since MI is a metric to measure the level of dependency between two random variables, by exploiting the MI-based loss, we can encourage the semantic encoder to capture the attribute-related information and at the same time discourage the residual encoder to capture any attribute-related information. As a result, all the attribute-related information can be solely captured by the semantic encoder. Second, we use the similarity-based loss to enforce the semantic encoder not to catch any attribute-irrelevant information. For example, when a bird image contains human fingers (see Fig. 1 ), we do not want features related to the finger to be included in the semantic encoder output. To do so, we maximize the similarity between the semantic encoder outputs of images that are belonging to the same class (bird images in our example). Since attribute-irrelevant features are contained only in a few image samples (e.g., human fingers are included in a few bird images), by maximizing the similarity between the semantic encoder outputs of the same class, we can remove attribute-irrelevant information from the semantic encoder output.

From extensive experiments using various benchmark datasets (AwA1, AwA2, CUB, and SUN), we demonstrate that the proposed approach outperforms the conventional GZSL techniques by a large margin. For example, for the AwA2 and CUB datasets, our model achieves 2% improvement in the GZSL classification accuracy over the state-ofthe-art techniques.

The main task in GZSL is to learn the relationship between the image and the attribute from seen classes and then use it in the identification of unseen classes. Early GZSL works have focused on the training of a network measuring the compatibility score between the image feature and the attribute (Akata et al. 2015; Frome et al. 2013) . Once the network is trained properly, images can be classified by identifying the attribute achieving the maximum compatibility score. Recently, generative model-based GZSL approaches have been proposed (Mishra et al. 2018; Xian et al. 2018) . Key idea of these approaches is to generate synthetic image features of unseen classes from the attributes by employing a generative model (Mishra et al. 2018; Xian et al. 2018) . As a generative model, conditional variational autoencoder (CVAE) (Kingma and Welling 2013) and conditional Wasserstein generative adversarial network (CW-GAN) (Arjovsky, Chintala, and Bottou 2017) have been popularly used. By exploiting the generated image features of unseen classes as training data, a classification network identifying unseen classes can be trained in a supervised manner.

Over the years, many efforts have been made to improve the performance of the generative model. In (Xian et al. 2019; Schonfeld et al. 2019; Gao et al. 2020) , an approach to combine multiple generative models (e.g., CVAE and CW-GAN) has been proposed. In (Felix et al. 2018; Ni, Zhang, and Xie 2019) , an additional network estimating the image attribute from the image feature has been used to make sure that the synthetic image features satisfy the attribute of unseen classes. In (Xian et al. 2018; Vyas, Venkateswara, and Panchanathan 2020; Li et al. 2019) , an additional image classifier has been used in the generative model training to generate distinct image features for different classes.

Our approach is conceptually similar to the generative model-based approach in the sense that we generate synthetic image features of unseen classes using the generative model. The key distinctive point of the proposed approach over the conventional approaches is that we use the features containing only attribute-related information in the classification to remove the interference, if any, caused by the attribute-irrelevant information.

Mathematically, the MI I(u, v) between two random variables u and v is defined as

where p(u, v) is the joint probability density function (PDF) of u and v, and p(u) and p(v) are marginal PDFs of u and v, respectively. In practice, it is very difficult to compute the exact value of MI since the joint PDF p(u, v) is generally unknown and the integrals in (1) are often intractable. To approximate MI, various MI estimators have been proposed (Oord, Li, and Vinyals 2018; Cheng et al. 2020) . Representative estimators include InfoNCE (Oord, Li, and Vinyals 2018) and contrastive log-ratio upper bound (CLUB) (Cheng et al. 2020) , defined as

where f is a pre-defined score function measuring the compatibility between u and v, and p(v|u) is the conditional PDF of v given u, which is often approximated by a neural network.

The relationship between MI, InfoNCE, and CLUB is given by

Recently, InfoNCE and CLUB have been used to strengthen or weaken the independence between different parts of the neural network. For example, when one tries to enforce the independence between u and v, that is, to reduce I(u, v), an approach to minimize the upper bound I CLUB (u, v) of MI can be used (Yuan et al. 2021) . Whereas, when one wants to maximize the dependence between u and v, that is, to increase I(u, v), an approach to maximize the lower bound I InfoNCE (u, v) of MI (Tschannen et al. 2019) can be used.

In this section, we present the proposed GZSL technique called semantic feature extraction-based GZSL (SE-GZSL). We first discuss how to extract the semantic feature from the image feature and then delve into the GZSL classification using the extracted semantic feature.

In extracting the semantic feature from the image feature, the proposed SE-GZSL technique uses the modified autoencoder architecture where two encoders, called semantic and residual encoders, are used in capturing the attribute-related information and the attribute-irrelevant information, respectively (see Fig 2) . As mentioned, in the autoencoder training, we use two loss functions: 1) MI-based loss to encourage the semantic encoder to capture all attribute-related information and 2) similarity-based loss to encourage the semantic encoder not to capture attribute-irrelevant information. In this subsection, we discuss the overall training loss with emphasis on these two.

To make sure that all the attribute-related information is contained in the semantic encoder output, we use MI in the autoencoder training. To do so, we maximize MI between the semantic encoder output and the attribute which is given by manual annotation. At the same time, to avoid capturing of attribute-related information in the residual encoder, we minimize MI between the residual encoder output and the attribute. Let z s and z r be the semantic and residual encoder outputs corresponding to the image feature x, and a be the image attribute (see Fig. 2 ). Then, our training objective can be expressed as

where λ s and λ r (λ s , λ r > 0) are weighting coefficients.

Since the computation of MI is not tractable, we use In-foNCE and CLUB (see (2) and (3)) as a surrogate of MI. In our approach, to minimize the objective function in (5), we express its upper bound using InfoNCE and CLUB and then train the autoencoder in a way to minimize the upper bound. Using the relationship between MI and its estimators in (4), the upper bound L MI of the objective function in (5) is

Let Y s be the set of seen classes, a c be the attribute of a seen class c ∈ Y s , and {x c,r be the semantic and residual encoder outputs corresponding to the input image feature x (i) c , respectively, then L MI can be expressed as

where N = c∈Ys N c is the total number of training image features.

Similarity-based Loss We now discuss the similaritybased loss to enforce the semantic encoder not to capture any attribute-irrelevant information.

Since images belonging to the same class have the same attribute, attribute-related image features of the same class would be more or less similar. This means that if the semantic encoder captures attribute-related information only, then the similarity between semantic encoder outputs of the same class should be large. Inspired by this observation, to remove the attribute-irrelevant information from the semantic encoder output, we train the semantic encoder in a way to maximize the similarity between outputs of the same class:

where the similarity is measured using the cosine-similarity function defined as

Also, we minimize the similarity between semantic encoder outputs of different classes to obtain sufficiently distinct semantic encoder outputs for different classes:

c ,s )).

Using the fact that one can maximize A and minimize B at the same time by minimizing − log 1 1+B/A = − log A A+B , we obtain the similarity-based loss as

Overall Loss By adding the conventional reconstruction loss L recon for the autoencoder, the MI-based loss L MI , and the similarity-based loss L sim , we obtain the overall loss function as

where λ sim is a weighting coefficient and L recon is the reconstruction loss given by

Here, x

c is the image feature reconstructed using the semantic and residual encoder outputs (z 

So far, we have discussed how to extract the semantic feature from the image feature. We now discuss how to perform the GZSL classification using the semantic feature.

In a nutshell, we synthesize semantic feature samples for unseen classes from their attributes. Once the synthetic samples are generated, the semantic classifier identifying unseen classes from the semantic feature is trained in a supervised manner.

Semantic Feature Generation To synthesize the semantic feature samples for unseen classes, we first generate image features from the attributes of unseen classes and then extract the semantic features from the synthetic image features using the semantic encoder (see Fig. 3 ).

In synthesizing the image feature, we employ WGAN that mitigates the unstable training issue of GAN by exploiting a Wasserstein distance-based loss function (Arjovsky, Chintala, and Bottou 2017) . The main component in WGAN is a generator G synthesizing the image feature x c from a random noise vector ∼ N (0, I) and the image attribute a c (i.e., x c = G( , a c )). Conventionally, WGAN is trained to minimize the Wasserstein distance between the distributions of real image feature x c and generated image feature x c given by

where D is an auxiliary network (called critic), x c = αx c + (1 − α) x c (α ∼ U(0, 1)), and λ gp is the regularization coefficient (a.k.a., gradient penalty coefficient) (Gulrajani et al. 2017 ). In our scheme, to make sure that the semantic feature z c,s obtained from x c is similar to the real semantic feature z c,s , we additionally use the following losses in the WGAN training:

We note that these losses are analogous to the losses with respect to the real semantic feature z c,s in (6) and (10), respectively. By combining (13), (14), and (15), we obtain the overall loss function as

where λ G,MI and λ G,sim are weighting coefficients. After the WGAN training, we use the generator G and the semantic encoder E s in synthesizing semantic feature samples of unseen classes. Specifically, for each unseen class u ∈ Y u , we generate the semantic feature z u,s by synthesizing the image feature x u = G( , a u ) using the generator and then exploiting it as an input to the semantic encoder (see Fig. 3 ):

By resampling the noise vector ∼ N (0, I), a sufficient number of synthetic semantic features can be generated.

Semantic Feature-based Classification After generating synthetic semantic feature samples for all unseen classes, 

Semantic feature Figure 3 : Illustration of the synthetic semantic feature generation for unseen classes.

we train the semantic feature classifier using a supervised learning model (e.g., softmax classifier, support vector machine, and nearest neighbor). Suppose, for example, that the softmax classifier is used as a classification model.

be the set of synthetic semantic feature samples for the unseen class u, then the semantic feature classifier is trained to minimize the cross entropy loss 1

where

and w y and b y are weight and bias parameters of the softmax classifier to be learned.

There have been previous efforts to extract the semantic feature from the image feature (Tong et al. 2019; Han, Fu, and Yang 2020; Li et al. 2021; Chen et al. 2021) . While our approach seems to be a bit similar to and (Chen et al. 2021) in the sense that the autoencoder-based image feature decomposition method is used for the semantic feature extraction, our work is dearly distinct from those works in two respects. First, we use different training strategy in capturing the attribute-related information. In our approach, to make sure that the semantic encoder output contains all the attribute-related information, we use two complementary loss terms: 1) the loss term to encourage the semantic encoder to capture the attribute-related information and 2) the loss term to discourage the residual encoder to capture any attribute-related information (see (5)). Whereas, the training loss used to remove the attribute-related information from the residual encoder output has not been used in Chen et al. 2021) . Also, we employ a new training loss L sim to remove the attribute-irrelevant information from the semantic encoder output (see (10)), for which there is no counterpart in Chen et al. 2021 ).

Datasets In our experiments, we evaluate the performance of our model using four benchmark datasets: AwA1, AwA2, 1 We recall that {z (Welinder et al. 2010 ). The SUN dataset contains 717 classes of scene images annotated with 102 attributes (Patterson and Hays 2012) .

In dividing the total classes into seen and unseen classes, we adopt the conventional dataset split presented in (Xian, Schiele, and Akata 2017) .

Implementation Details As in (Xian et al. 2018 ; Schonfeld et al. 2019), we use ResNet-101 (He et al. 2016 ) as a pre-trained classification network and fix it in our training process. We implement all the networks in SE-GZSL (semantic encoder, residual encoder, and decoder in the image feature decomposition network, and generator and critic in WGAN) using the multilayer perceptron (MLP) with one hidden layer as in (Xian et al. 2018 (Xian et al. , 2019 . We set the number of hidden units to 4096 and use LeakyReLU with a negative slope of 0.02 as a nonlinear activation function. For the output layer of the generator, the ReLU activation is used since the image feature extracted by ResNet is non-negative. As in (Oord, Li, and Vinyals 2018) , we define the score function f in (6) as f (z s , a) = z T s Wa where W is a weight matrix to be learned. Also, as in (Cheng et al. 2020) , we approximate the conditional PDF p(a|z r ) in (6) using a variational encoder consisting of two hidden layers. The gradient penalty coefficient in the WGAN loss L G,WGAN is set to λ gp = 10 as suggested in the original WGAN paper (Gulrajani et al. 2017) . We set the weighting coefficients in (7), (11), and (16) to λ s = 20, λ r = 50, λ sim = 1, λ G,MI = 1, λ G,sim = 0.025.

We first investigate whether the image classification performance can be improved by exploiting the semantic feature. To this end, we train two image classifiers: the classifier exploiting the image feature and the classifier utilizing the semantic feature extracted by the semantic encoder. To compare the semantic feature directly with the image feature, we use the simple softmax classifier as a classification model. In Table 1 , we summarize the top-1 classification accuracy of each classifier on test image samples for seen classes. We observe that the semantic feature-based classifier outperforms the image feature-based classifier for all datasets. In particular, for the SUN and CUB datasets, the semantic feature-based classifier achieves about 2% improvement in the top-1 classification accuracy over the image feature-based classifier, which demonstrates that the image classification performance can be enhanced by removing the attribute-irrelevant information in the image feature.

In Fig. 4 , we visualize semantic feature samples obtained from the CUB dataset using a t-distributed stochastic neighbor embedding (t-SNE), a tool to visualize high-dimensional data in a two-dimensional plane (Van der Maaten and Hinton 2008). For comparison, we also visualize image feature samples and residual feature samples extracted by the residual encoder. We observe that semantic feature samples containing only attribute-related information are well-clustered, that is, samples of the same class are grouped and samples of different classes are separated (see Fig. 4(a) ). Whereas, image feature samples of different classes are not separated sufficiently (see Fig. 4 (b)) and residual feature samples are scattered randomly (see Fig. 4 (c)).

We next evaluate the GZSL classification performance of the proposed approach using the standard evaluation protocol presented in (Xian, Schiele, and Akata 2017) . Specifically, we measure the average top-1 classification accuracies acc s and acc u on seen and unseen classes, respectively, and then use their harmonic mean acc h as a metric to evaluate the performance. In Table 2 Gao et al. 2020) .

From the results, we observe that the proposed SE-GZSL outperforms conventional image feature-based approaches by a large margin. For example, for the AwA2 dataset, SE-GZSL achieves about 5% improvement in the harmonic mean accuracy over image feature-based approaches. We also observe that SE-GZSL outperforms existing semantic feature-based approaches for all datasets. For example, for the AwA1, AwA2, and CUB datasets, our model achieves about 2% improvement in the harmonic mean accuracy over the state-of-the-art approaches.

Effectiveness of Loss Functions In training the semantic feature extractor, we have used the MI-based loss L MI and the similarity-based loss L sim . To examine the impact of each loss function, we measure the performance of three different versions of SE-GZSL: 1) SE-GZSL trained only with the reconstruction loss L recon , 2) SE-GZSL trained with L recon and L MI , and 3) SE-GZSL trained with L recon , L MI , and L sim . From the results in Table 3 , we observe that the performance of SE-GZSL can be enhanced greatly by exploiting the MI-based loss L MI . In particular, for the AwA1 and CUB datasets, we achieve more than 5% improvement in the harmonic mean accuracy by utilizing L MI . Also, for the AwA2 dataset, we achieve about 4% improvement of the accuracy. One might notice that when L MI is not used, SE-GZSL performs worse than conventional image featurebased methods (see Table 2 ). This is because the semantic encoder cannot capture all the attribute-related information without L MI , and thus using the semantic encoder output in the classification incurs the loss of the attribute-related information. We also observe that the performance of SE-GZSL can be improved further by exploiting the similaritybased loss L sim . For example, for the AwA2 dataset, more than 3% improvement in the harmonic mean accuracy can be achieved by utilizing L sim .

For the semantic feature extraction, we have decomposed the image feature into the attribute-related feature and the attribute-irrelevant feature using the semantic and residual encoders. An astute reader might ask why the residual encoder is needed to extract the semantic feature. To answer this question, we measure the performance of SE-GZSL without using the residual encoder. From the results in Table 4 , we can observe that the GZSL performance of SE-GZSL is degraded when the residual encoder is not used. This is because if the residual encoder is removed, then the attribute-irrelevant information, required for the reconstruction of the image feature, would be contained in the semantic encoder output and therefore mess up the process to learn the relationship between the image feature and the attribute.

In this paper, we presented a new GZSL technique called SE-GZSL. Key idea of the proposed SE-GZSL is to exploit the semantic feature in learning the relationship between the image and the attribute, removing the interference caused by the attribute-irrelevant information. To extract the semantic feature, we presented the autoencoderbased image feature decomposition network consisting of semantic and residual encoders. In a nutshell, the semantic and residual encoders capture the attribute-related information and the attribute-irrelevant information, respectively. In training the image feature decomposition network, we used MI-based loss to encourage the semantic encoder to capture all the attribute-related information and similarity-based loss to discourage the semantic encoder to capture any attributeirrelevant information. Our experiments on various datasets demonstrated that the proposed SE-GZSL outperforms conventional GZSL approaches by a large margin.

Label-Embedding for Image Classification

Wasserstein Generative Adversarial Networks

An Empirical Study and Analysis of Generalized Zero-Shot Learning for Object Recognition in the Wild

Semantics Disentangling for Generalized Zero-Shot Learning

CLUB: A Contrastive Log-ratio Upper Bound of Mutual Information

Multi-Modal Cycle-Consistent Generalized Zero-Shot Learning

DeViSE: A Deep Visual-Semantic Embedding Model

Deep Learning-based Image Recognition for Autonomous Driving

Zero-VAE-GAN: Generating Unseen Features for Generalized and Transductive Zero-Shot Learning

Improved Training of Wasserstein GANs

Learning the Redundancy-Free Features for Generalized Zero-Shot Object Recognition

Deep Residual Learning for Image Recognition

Auto-Encoding Variational Bayes

Learning to Detect Unseen Object Classes by Between-Class Attribute Transfer

Leveraging the Invariant Side of Generative Zero-Shot Learning

Generalized Zero-Shot Learning via Disentangled Representation

A Generative Model for Zero Shot Learning Using Conditional Variational Autoencoders

Dual Adversarial Semantics-Consistent Network for Generalized Zero-Shot Learning

Representation Learning With Contrastive Predictive Coding

Sun Attribute Database: Discovering, Annotating, and Recognizing Scene Attributes

A Generic Deep-Learningbased Approach for Automated Surface Inspection

U-net: Convolutional Networks for Biomedical Image Segmentation

Generalized Zero-and Few-Shot Learning via Aligned Variational Autoencoders

Very Deep Convolutional Networks for Large-Scale Image Recognition

Iris Image Classification Based on Hierarchical Visual Codebook

Hierarchical Disentanglement of Discriminative Latent Features for Zero-Shot Learning

On Mutual Information Maximization for Representation Learning

Visualizing Data Using t-SNE

Leveraging Seen and Unseen Semantic Relationships for Generative Zero-Shot Learning

Caltech-UCSD Birds

Feature Generating Networks for Zero-Shot Learning

Zero-Shot Learning-the Good, the Bad and the Ugly

f-VAEGAN-D2: A Feature Generating Framework for Any-Shot Learning

Improving Zero-shot Voice Style Transfer via Disentangled Representation Learning