key: cord-0626705-zhxrgt0j
authors: Pourpanah, Farhad; Abdar, Moloud; Luo, Yuxuan; Zhou, Xinlei; Wang, Ran; Lim, Chee Peng; Wang, Xi-Zhao; Wu, Q. M. Jonathan
title: A Review of Generalized Zero-Shot Learning Methods
date: 2020-11-17
journal: nan
DOI: nan
sha: 647dccc0cdbbcab51110bf34a3f749752655b8d3
doc_id: 626705
cord_uid: zhxrgt0j

Generalized zero-shot learning (GZSL) aims to train a model for classifying data samples under the condition that some output classes are unknown during supervised learning. To address this challenging task, GZSL leverages semantic information of the seen (source) and unseen (target) classes to bridge the gap between both seen and unseen classes. Since its introduction, many GZSL models have been formulated. In this review paper, we present a comprehensive review on GZSL. Firstly, we provide an overview of GZSL including the problems and challenges. Then, we introduce a hierarchical categorization for the GZSL methods and discuss the representative methods in each category. In addition, we discuss the available benchmark data sets and applications of GZSL, along with a discussion on the research gaps and directions for future investigations.

W ITH recent advances in image processing and computer vision, deep learning (DL) models have achieved extensive popularity due to their capability for providing an end-to-end solution from feature extraction to classification. Despite their success, traditional DL models require training on a massive amount of labeled data for each class, along with a large number of samples. In this respect, it is a challenging issue to collect large-scale labelled samples. As an example, ImageNet [1] , which is a large data set, contains 14 million images with 21,814 classes in which many classes contain only few images. In addition, standard DL models can only recognize samples belonging to the classes that have been seen during the training phase, and they are not able to handle samples from unseen classes [2] . While in many real-world scenarios, there may not be a significant amount of labeled samples for all classes. On one hand, fine-grained annotation of a large number of samples is laborious and it requires an expert domain knowledge. On the other hand, many categories are lack of sufficient labeled samples, e.g., endangered birds, or being observed in progress, e.g., COVID-19, or not covered during training but appear in the test phase [3] - [6] .

Several techniques for various learning configurations have been developed. One-shot [7] and few-shot [8] learning techniques can learn from classes with a few learning samples. These techniques use the knowledge obtained from data samples of other classes and formulate a classification model for handling classes with few samples. While the open set recognition (OSR) [9] techniques can identify whether a test sample belongs to an unseen class, they are not able to predict an exact class label. Out-of-distribution [10] techniques attempt to identify test samples that are different arXiv:2011.08641v4 [cs.CV] 19 May 2022 from the training samples. However, none of the abovementioned techniques can classify samples from unseen classes. In contrast, human can recognize around 30,000 categories [11] , in which we do not need to learn all these categories in advance. As an example, a child can easily recognize zebra, if he/she has seen horses previously, and have the knowledge that a zebra looks like a horse with black and white strips. Zero-shot learning (ZSL) [12] , [13] techniques offer a good solution to address such challenge.

ZSL aims to train a model that can classify objects of unseen classes (target domain) via transferring knowledge obtained from other seen classes (source domain) with the help of semantic information. The semantic information embeds the names of both seen and unseen classes in highdimensional vectors. Semantic information can be manually defined attribute vectors [14] , automatically extracted word vectors [15] , context-based embedding [16] , or their combinations [17] , [18] . In other words, ZSL uses semantic information to bridge the gap between the seen and unseen classes. This learning paradigm can be compared to a human when recognizing a new object by measuring the likelihoods between its descriptions and the previously learned notions [19] . In conventional ZSL techniques, the test set only contains samples from the unseen classes, which is an unrealistic setting and it does not reflect the real-world recognition conditions. In practice, data samples of the seen classes are more common than those from the unseen ones, and it is important to recognize samples from both classes simultaneously rather than classifying only data samples of the unseen classes. This setting is called generalized zero-shot learning (GZSL) [20] . Indeed, GZSL is a pragmatic version of ZSL. The main motivation of GZSL is to imitate human recognition capabilities, which can recognize samples from both seen and unseen classes. Fig. 1 presents a schematic diagram of GZSL and ZSL.

Scheirer et al. [8] was the first to introduce the concept of GZSL in 2013. However, GZSL did not gain traction until 2016, when Chao et al. [20] empirically showed that the techniques under ZSL setting cannot perform well under the GZSL setting. This is because ZSL is easy to overfit on the seen classes, i.e., classify test samples from unseen classes as a class from the seen classes. Later, Xian et al. [21] , [22] and Liu et al. [23] obtained similar findings on image and web-scale video data with ZSL, respectively. This is mainly because of the strong bias of the existing techniques towards the seen classes in which almost all test samples belonging to unseen classes are classified as one of the seen classes. To alleviate this issue, Chao et al. [20] introduced an effective calibration technique, called calibrated stacking, to balance the trade-off between recognizing samples from the seen and unseen classes, which allows learning knowledge about the unseen classes. Since then, the number of proposed techniques under the GZSL setting has been increased radically.

Since its introduction, GZSL has attracted the attention of many researchers. Although several comprehensive reviews of ZSL models can be found in the literature [3] , [22] , [24] , [25] . The main differences between our review paper with previous ZSL survey literature [3] , [22] , [24] , [25] are as follows. In [3] , the mainly focus is on ZSL, and only a few GZSL methods have been reviewed. In [22] , the impacts of various ZSL and GZSL methods in different case studies are investigated. Several SOTA ZSL and GZSL methods have been selected and evaluated using different data sets. However, the work [22] is more focused on empirical research rather than a review paper on ZSL and GZSL methods. The study in [24] is focused on ZSL, with only a brief discussion (a few paragraphs) on GZSL. Rezaei and Shahidi [25] studied the importance of ZSL methods for COVID-19 diagnosis (medical application). Unlike the aforementioned review paper, in this manuscript, we focus on GZSL, rather than ZSL, methods. However, none of them include an in-depth survey and analysis of GZSL. To fill this gap, we aim to provide a comprehensive review of GZSL in this paper, including the problem formulation, challenging issues, hierarchical categorization, and applications.

We review published articles, conference papers, book chapters, and high-quality preprints (i.e. arXiv) related to GZSL commencing from its popularity in 2016 till early 2021. However, we may miss some of the recently published studies, which is not avoidable. In summary, the main contributions of this review paper include:

• comprehensive review of the GZSL methods, to the best of our knowledge, this is the first paper that attempts to provide an in-depth analysis of the GZSL methods;

• hierarchical categorization of the GZSL methods along with their corresponding representative models and real-world applications;

• elucidation on the main research gaps and suggestions for future research directions.

This review paper contains six sections. Section 2 gives an overview of GZSL, which includes the problem formulation, semantic information, embedding spaces and challenging issues. Section 3 reviews inductive and semantic transductive GZSL methods, in which a hierarchical categorization of the GZSL methods is provided. Each category is further divided into several constituents. Section 4 focuses on the transductive GZSL methods. Section 5 presents the applications of GZSL to various domains, including computer vision and natural language processing (NLP). A discussion on the research gaps and trends for future research, along with concluding remarks, is presented in Section 6.

The training phase of GZSL methods can be divided into two broad settings: inductive learning and transductive learning [22] . The inductive setting uses only the seen class information to build a model for recognition. As such, the training set for inductive GZSL can be denoted by

represents the D-dimensional images (visual features) of the seen classes in the feature space X , which can be obtained using a pre-trained deep model such as ResNet [26] , VGG-19 [27] , GoogLeNet [28] ; a s i ∈ R K indicates the K-dimensional semantic representations of seen classes (i.e., attributes or word vectors) in the semantic space A; Y s = {y s 1 , ..., y s Cs } indicates the label set of seen classes in the label space Y and C s is the number of seen classes.

The transductive setting, in addition to the seen class information D tr , has access to the semantic representations and unlabeled visual features of unseen classes X u [22] , [29] . If only the semantic representations of unseen classes are available during the training phase, i.e., as the case in most of the generative-based methods, it is called semantic transductive GZSL. Fig. 2 illustrates the main difference of transductive GZSL, semantic transductive GZSL, and inductive GZSL [30] , [31] . Although several frameworks have been developed under transductive learning [32] - [39] , this learning paradigm is impractical. On one hand, it violates the unseen assumption and reduces challenge. On the other hand, the unseen classes are usually rare, and it is not practical to assume that unlabeled data for all unseen classes are available.

The label space of unseen classes is denoted by Y u = {y u 1 , ..., y u Cu }, where C u is the number of unseen classes, and Y = Y s ∪ Y u denotes the union of both seen and unseen classes, where Y s ∩Y u = ∅. Both inductive and transdactive GZSL settings aim to learn a model f GZSL : X → Y to classify N t test samples, i.e., D ts = {x m , y m } Nt m=1 where x m ∈ R D , and y m ∈ Y.

To evaluate the performance of the GZSL methods, several indicators have been used in the literature. Accuracy of seen (Acc s ) and accuracy of unseen (Acc u ) classes are two common performance indicators. Chao et al. [20] introduced the area under seen-unseen accuracy curve (AUSUC) to balance the trade-off recognizing between seen and unseen classes for calibration-based techniques. This curve can be obtained by varying γ in equation (1) in the main text. Techniques with higher AUSUC values aim to achieve a balanced performance in GZSL tasks.

Harmonic mean (HM) is another performance indicator that is able to measure the inherent biasness of GZSL-based methods with respect to the seen classes:

If a GZSL method is biased towards the seen classes, its Acc s is higher than Acc u consequently the HM score drops down [40] .

Semantic information is the key to GZSL. Since there are no labelled samples from the unseen classes, semantic information is used to build a relationship between both seen and unseen classes, thus making it possible to perform generalized zero-shot recognition. The semantic information must contain recognition properties of all unseen classes to guarantee that enough semantic information is provided for each unseen class. It also should be related to the samples in the feature space to guarantee usability of semantic information. The idea of using semantic information is inspired by human recognition capability. Human can recognize samples from the unseen classes with the help of semantic information. As an example, a child can easily recognize zebra, if he/she has seen horses previously, and have the knowledge that a zebra looks like a horse with black and white strips. Semantic information builds a space that includes both seen and unseen classes, which can be used to perform ZSL and GZSL. The most widely used semantic information for GZSL can be grouped into manually defined attributes [13] , word vectors [41] , or their combinations.

These attributes describe the high-level characteristics of a class (category), such as shape (i.e., circle) and color (i.e., blue), which enable the GZSL model to recognize classes in the world. The attributes are accurate, but require human efforts in annotation, which are not suitable for large-scale problems [42] . Wu et al. [43] proposed a global semantic consistency network (GSC-Net) to exploit the semantic attributes for both seen and unseen classes. Lou et al. [44] developed a data-specific feature extractor according to the attribute label tree.

These vectors are automatically extracted from large text corpus (such as Wikipedia) to represent the similarities and differences between various words and describe the properties of each object. Word vectors require less human labor, therefore they are suitable for large-scale data sets. However, they contain noise which compromises the model performance. As an example, Wang et al. [45] applied Node2Vec to produce the conceptualized word vectors. The studies in [46] - [49] attempted to extract semantic representations from noisy text descriptions, and Akata et al. [50] proposed to extract semantic representations from multiple textual sources.

Most GZSL methods learn an embedding/mapping function to associate the low-level visual features of the seen classes with their corresponding semantic vectors. This function can be optimized either via a ridge regression loss [52] , [53] or ranking loss with respect to the compatibility scores of two spaces [40] . Then, the learned function is used to recognize novel classes by measuring the similarity level between the prototype representations and predicted representations of the data samples in the embedding space. As every entry of the attribute vector represents a description of the class, it is expected that the classes with similar descriptions contain a similar attribute vector in the semantic space. However, in the visual space, the classes with similar attributes may have large variations. Therefore, finding such an embedding space is a challenging task, causing visual semantic ambiguity problems. On the one hand, the embedding space can be divided into either Euclidean or non-Euclidean spaces. While the Euclidean space is simpler, it is subject to information loss. The non-Euclidean space, which is commonly based on graph networks, manifold learning, or clusters, usually uses the geometrical relation between spaces to preserve the relationships among the data samples [25] . On the other hand, the embedding space can be categorized into: semantic embedding, visual embedding and latent space embedding [51] . Each of these categories are discussed in the following subsections.

Semantic embedding (Fig 3 (a) ) learns a (forward) projection function from the visual space to the semantic space using different constraints or loss functions, and perform classification in the semantic space. The aim is to force semantic embedding of all images belonging to a class to be mapped to some ground-truth label embedding [40] , [54] . Once the best projection function is obtained, the nearest neighbor search can be performed for recognition of a given test image.

Visual embedding ( Fig. 3 (b) ) learns a (reverse) projection function to map the semantic representations (back) into the visual space, and perform classification in the visual space. The goal is to make the semantic representations close to their corresponding visual features [52] . After obtaining the best projection function, the nearest neighbor search can be used to recognize a given test image.

Both semantic and visual embedding models learn a projection/embedding function from the space of one modality, i.e., visual or semantic, to the space of other modality. However, it is a challenging issue to learn an explicit projection function between two spaces due to the distinctive properties of different modalities. In this respect, latent space embedding (Fig. 3 (c)) projects both visual features and semantic representations into a common space L, i.e., a latent space, to explore some common semantic properties across different modalities [38] , [56] , [57] . The aim is to project visual and semantic features of each class nearby into the latent space. An ideal latent space should fulfill two conditions: (i) intra-class compactness, and (ii) interclass separability [38] . Introduced by Zhang et al. [58] , this mapping aims to overcome the hubness problem of ZSL models, which is discussed in Sub-section 2.4.

In GZSL, several challenging issues must be addressed. The hubness problem [58] , [59] is one of the challenging issues of early ZSL and GZSL methods that learn semantic embedding space and utilize the nearest neighbor search to perform recognition. Hubness is an aspect of the curse of dimensionality that affects the nearest neighbors method, i.e., the number of times that a sample appears within the k-nearest neighbors of other samples [60] . Dinu et al. [61] observed that a large number of different map vectors are surrounded by many common items, in which the presence of such items causes problem in high-dimensional spaces.

The domain shift problem is another challenging issues of ZSL and GZSL methods. On one hand, both visual space and semantic space are two different spaces. On the other hand, data samples of seen and unseen classes are disjoint, unrelated for some classes and their distributions are different, resulting in a large domain gap. Thus, learning an embedding space using data samples from the seen classes without any adaptation to the unseen classes causes the domain shift problem [55] , [56] , [62] . This problem is more challenging in GZSL, due to the existence of the seen classes during prediction. Fig. 4 shows some examples of ideal and practical mappings. This problem is more common in inductive-based methods, as they have no access to the unseen class data during training. To overcome this problem, inductive-based methods incorporate additional constraints or information from the seen classes. Besides that, several transdactive-based methods have been developed to alleviate the domain shift problem [33] - [36] . These methods use the manifold information of the unseen classes for learning. Since GZSL methods use the data samples from the seen classes to learn a model to perform recognition for both seen and unseen classes, they are usually biased towards the seen classes, leading to misclassification of data from the unseen classes into the seen classes (see Fig. 5 ), in which most of the ZSL methods cannot effectively solve this problem [63] . To mitigate this issue, several strategies have been proposed, such as calibrated stacking [20] , [64] and novelty detector [15] , [65] - [67] . The calibrated stacking [20] method balances the trade-off between recognizing data samples from both seen and unseen classes using the following formulation.ŷ

where γ is a calibration factor and [·] ∈ {0, 1} indicates whether c is from seen classes or otherwise. In fact, γ can be interpreted as the prior likelihood of a sample from the unseen classes. When γ → −∞, the classifier will classify all data samples into one of the seen classes, and vice versa. Le et al. [68] proposed to find an optimal γ that balances the trade-off between accuracy of the seen and unseen classes. Later, several studies used the calibrated stacking technique to solve the GZSL problem [64] , [69] - [72] . Similar to the calibrated stacking, scaled calibration [73] and probabilistic representation [74] , [75] have been proposed to balance the trade-off between both seen and unseen classes. Studies [2] , [76] made the unseen classes more confident and the seen classes less confident using temperature scaling [77] . Detectors aim to identify whether a test sample belongs to the seen or unseen classes. This strategy limits the set of possible classes by providing information to which set (seen or unseen) a test sample belongs to. Socher et al. [15] considered that the unseen classes are projected to out-ofdistribution (OOD) with respect to the seen ones. Then, data samples from unseen classes are treated as outliers with respect to the distribution of the seen classes. Bhattaxharjee [65] developed an auto-encoder-based framework to identify the set of possible classes. To achieve this, additional information, i.e., the correct class information, is imposed into the decoder to reconstruct the input samples.

Later, entropy-based [67] , probabilistic-based [66] , [78] , distance-based [79] , cluster-based [80] and parametric novelty detection [43] approaches have been developed to detect OOD, i.e., the unseen classes. Felix et al. [81] learned The embedding based methods (a) lean an embedding space to project the visual and semantic features of seen classes into a common space. Then, the learned embedding space is used to perform recognition. In contrast, the generativebased methods (b) learn a generative model based on samples of seen classes conditioned on their semantic features. Then, the learned model is used to generate visual features for unseen classes using the semantic features of unseen classes. a discriminative model using the latent space to identify whether a test sample belongs to a seen or unseen class. Geng et al. [82] decomposed GZSL into open set recognition (OSR) [9] and ZSL tasks.

The main idea of GZSL is to classify objects of both seen and unseen classes by transferring knowledge from the seen classes to the unseen ones through semantic representations. To achieve this, two key issues must be addressed: (i) how to transfer knowledge from the seen classes to unseen ones; (ii) how to learn a model to recognize images from both seen and unseen classes without having access to the labeled samples of unseen classes [2] . In this regard, many methods have been proposed, which can be broadly categorized into:

• Embedding-based methods: learn an embedding space to associate the low-level visual features of 6 GZSL Embedding-based methods

The taxonomy of GZSL models.

seen classes with their corresponding semantic vectors. The learned projection function is used to recognize novel classes by measuring the similarity level between the prototype representations and predicted representations of the data samples in the embedding space (see Fig. 6 (a)).

• Generative-based methods: learn a model to generate images or visual features for the unseen classes based on the samples of seen classes and semantic representations of both classes. By generating samples for unseen classes, a GZSL problem can be converted into a conventional supervised learning problem (see Fig. 6 (b)). Based on single homogenous process, a model can be trained to classify the test samples belonging to both seen and unseen classes and solve the bias problem. Although these methods perform recognition in the visual space and they can be categorized as visual embedding models, we separate them from the embedding based methods.

A hierarchical categorization of both methods, together with their sub-categories, is provided in Fig. 7 .

In recent years, various embedding-based methods have been used to formulate a framework to tackle GZSL problems. These methods can be divided into graph-based, attention-based, autoencoder-based, meta learning, compositional learning and bidirectional learning methods, as shown in Fig. 7 . In the following subsections, we review each of these categories, and provide a summary of these methods in Table 4 .

Graphs are useful for modelling a set of objects with a data structure consisting of nodes and their relationships (edges) [83] . Graph learning leverages machine learning techniques to extract relevant features, mapping properties of a graph into a feature vector with the same dimensions in the embedding space. Machine learning techniques convert graph-based properties into a set of features without projecting the extracted information into a lower dimensional space [84] . Generally, each class is represented as a node in a graph-based method. Each node is connected to other nodes (i.e., classes) through edges that encode their relationships. The geometric structure of features in the latent space is preserved in a graph, leading to a compact representation of richer information, as compared with other techniques [56] . Nonetheless, learning a classifier using structured information and complex relationships without visual examples for the unseen classes is a challenging issue, and the use of graph-based information increases the model complexity.

Recently, graph learning techniques have shown effective paradigm in GZSL [56] , [85] - [89] . For example, the shared reconstructed graph (SRG) [56] (Fig. 8 ) uses the cluster center of each class to represent the class prototype in the image feature space, and reconstructs each semantic prototype as follows:

where b k ∈ R K×1 includes the reconstruction coefficients, and K = 1, ..., C s + C u . After learning the relationships among classes, the shared reconstruction coefficients between two spaces is learned to synthesize image prototypes for the unseen classes. SRG contains a sparsity constraint, which enables the model to divide the classes into many clusters of different subspaces.A regularization term is adopted to select fewer and relevant classes during the reconstruction process. The reconstruction coefficients are shared, in order to transfer knowledge from the semantic prototypes to image prototypes. In addition, the unseen semantic embedding method is used to mitigate the domain shift problem, in which the graph of the seen image prototypes is adopted to alleviate the space shift problem. AGZSL [85] , i.e., asymmetric graph-based ZSL, combines the class-level semantic manifold with the instance-level visual manifold by constructing an asymmetric graph. In addition, a constraint is made to project the visual and attribute features orthogonally when they belong to different classes. The studies in [45] , [90] - [92] exploit the graph convolutional network (GCN) [93] to transfer knowledge among different categories. In [45] , [90] , GCN is applied to generate super-classes in the semantic space. The cosine distance is used to minimize the distance between the visual features of the seen classes and corresponding semantic representations, and a triplet margin loss function is optimized to avoid the hubness problem. Discriminative anchor generation and distribution alignment (DAGDA) [91] uses a diffusion-based GCN to generate anchors for each category. Specifically, a semantic relation regularization is derived to refine the distribution in the anchor space. To mitigate the hubness problem, two auto-encoders are employed to keep the original information of both features in the latent space. Besides that, Xie et al. [92] devised an attention technique to find the most important regions of the image and then used these regions as a node to represent the graph.

Meta learning, which is also known as learning to learn, is a subset of learning paradigms that learns from other learning algorithms. It aims to extract transferable knowledge from a set of auxiliary tasks, in order to devise a [56] . Firstly, the image prototype f k and semantic prototype e k of each class are reconstructed using the cluster center and (3), respectively. Then, the shared reconstruction coefficients between two spaces is learned. Finally, the learned SRG is used to synthesize class prototypes for unseen classes to perform prediction. model while avoiding the overfitting problem. The underlying principle of a meta learning method helps identify the best learning algorithm for a specific data set. Meta learning improves the performance of learning algorithms by changing some aspects according to the experimental results and by optimizing the number of experiments. Several studies [94] - [100] utilized meta learning strategy to solve GZSL problems. Meta learning based GZSL methods divide the training classes into two sets, i.e., support and query, which correspond to the seen and unseen classes. Different tasks are trained by randomly selecting the classes from both the support and query sets. This mechanism helps meta learning methods to transfer knowledge from the seen to unseen classes, therefore alleviating the bias problem [99] .

Sung et al. [94] trained an auxiliary parameterized network using a meta learning method. The aim is to paramaterize a feedforward neural network for GZSL. Specifically, a relation module is devised to compute the similarity metric between the output of a cooperation module and the feature vector of a data sample. Then, the learned function is used for recognition. Introduced in [95] , the meta-learning method consists of a task module to provide an initial prediction and a correction module to update the initial prediction. Various task modules are designed to learn different subsets of training samples. Then, the prediction from a task module is updated through training of the correction module. Other examples of generative-based meta-learning modules for GZSL are also available, e.g. [97] - [100] .

Unlike other methods that learn an embedding space between global visual features and semantic vectors, attentionbased methods focus on learning the most important image regions. In other words, the attention mechanism seeks to add weights into deep learning models as trainable parameters to augment the most important parts of the input, e.g., sentences and images. In general, attention-based methods are effective for identifying fine-grained classes because these classes contain discriminative information only in a few regions. Following the general principle of an attentionbased method, an image is divided into many regions, and the most important regions are identified by applying an attention technique [74] , [101] . One of the major advantages of the attention mechanism is its capability to recognize important information pertinent to performing a task in an input, leading to improved results. On the other hand, the attention mechanism generally increases computational load, affecting real-time implementation of attention-based methods.

The dense attention zero-shot learning (DAZLE) [74] (Fig. 9 ) obtains visual features by focusing on the most relevant regions pertaining to each attribute. Then, an attribute embedding technique is devised to adjust each obtained attribute feature with its corresponding attribute semantic vector. A similar approach is applied to solve multi-label GZSL problems in [102] . The attentive region embedding network (AREN) [101] incorporates an attentive compressed secondorder embedding mechanism to automatically discover the discriminative regions and capture the second-ordered appearance differences. In [103] , semantic representations are used to guide the visual features to generate an attention map. In [31] , an embedding space was formulated by measuring the focus ratio vector for each dimension using a self-focus mechanism. Zhu et al. [29] used a multi-attention model to map the visual features into a finite set of feature vectors. Each feature vector is modeled as a C-dimensional Gaussian mixture model with the isotropic components. Using these low-dimensional embedding mechanisms allows the model to focus on the most important regions of the image as well as remove irrelevant features, therefore reducing the semantic-visual gap. In addition, a visual oracle is proposed for GZSL to reduce noise and provide information on the presence/absence of classes.

The gaze estimation module (GEM) [104] , which is inspired by the gaze behavior of human, i.e., paying attention to the parts of an object with discriminative attributes, aims to improve the localization of discriminative attributes. In [105] , the localized attributes are used for projecting the local features into the semantic space. It exploits a global average pooling scheme as an aggregation mechanism to further reduce the bias (since the obtained patterns for the unseen classes are similar to those of seen ones) and improve localization. In contrast, the semantic-guided multiattention (SGMA) localization model [ Fig. 9 : A schematic of DAZLE [74] . After extracting image features of the R regions, the attention features of all attributes are computed using a dense attention mechanism. Then, the attention features are aligned with the attribute semantic vectors, in order to compute the score of attributes in the image.

from both global and local features to provide a richer visual expression.

Liu et al. [107] used a graph attention network [108] to generate optimized attribute vector for each class. Ji et al. [109] proposed a semantic attention network to select the same number of visual samples from each training class to ensure that each class contributes equally during each training iteration. In addition, the model proposed in [110] searches for a combination of semantic representations to separate one class from others. On the other hand, Paz et al. [49] designed visually relevant language [111] to extract relevant sentences to the objects from noisy texts.

Compositional learning (CL) aims to learn a model that can recognize a combinations of unseen compositions of known objects, e.g., fish and cat, and primitive states, e.g., cute and old [112] , [113] . Recently, the concept of CL has been applied to ZSL, known as compositional ZSL. Kato et al. [114] introduced a framework to recognize zero-shot human actions. A GCN is used to build an external knowledge graph based on the extracted subject, verb and object (SVO) triplets [115] from knowledge bases, in order to record a large range of human object interactions. Each node in the graph indicates a noun (object) or a verb (motion) with word embedding as its feature. Each action node, which is represented by a SVO-triplet, propagates information along the graph to learn its representations. Finally, both visual features and learned graph are jointly projected to a latent space for zeroshot recognition of human actions.

The task-driven modular network (TMN) [116] employs a modular structure to transfer concepts in the high-level semantic spaces of CNNs and extracts features that are related to all members of the input triplet to determine the jointcompatibility among visual features and object attributes. Sylvain et al. [117] empirically highlighted the importance of focusing on local image regions (via attention) as well as combining local knowledge to recognize the unseen classes. Their work has been further supported by subsequent studies [118] , [119] .

This category leverages the bidirectional projections to fully utilize information in data samples and learn more generalizable projections, in order to differentiate between the seen and unseen classes [51] , [71] , [120] , [121] . In [120] , the visual and semantic spaces are jointly projected into a shared subspace, and then each space is reconstructed through a bidirectional projection learning. The dual-triplet network (DTNet) [121] uses two triplet modules to construct a more discriminative metric space, i.e., one considers the attribute relationship between categories by learning a mapping from the attribute space to visual space, while another considers the visual features.

Guo et al. [71] considered a dual-view ranking by introducing a loss function that jointly minimizes the image view labels and label-view image rankings. This dual-view ranking scheme enables the model to train a better image-label matching model. Specifically, the scheme ranks the correct label before any other labels for a training sample, while the label-view image ranking aims to rank the respective images to their correct classes before considering the images from other classes. In addition, a density adaptive margin is used to set a margin based on the data samples. This is because the density of images varies in different feature spaces, and different images can have different similarity scores. The model proposed in [51] consists of two parts: (i) visual-label activating that learns a embedding space using regression; (ii) semantic-label activating that learns a projection function from the semantic space to the label space. In addition, a bidirectional reconstruction constraint between the semantic and labels is added to alleviate the projection shift problem.

Zhang et al. [30] explained a class level overfitting problem. It is related to parameter fitting during training without prior knowledge about the unseen classes. To solve this problem, a triple verification network (TVN) is used for addressing GZSL as a verification task. The verification procedure aims to predict whether a pair of given samples belongs to the same class or otherwise. The TVN model projects the seen classes into an orthogonal space, in order to obtain a better performance and a faster convergence speed. Then, a dual regression (DR) method is proposed to regress both visual features and semantic representations to be compatible with the unseen classes.

Autoencoders (AEs) are unsupervised learning techniques that leverage NNs for representation learning. They learn how to compress/encode the data firstly. Then, they learn how to reconstruct the data back into a representation as close to the original data as possible. In other words, AEs exploit an encoder to learn an embedding space and then employ a decoder to reconstruct the inputs. The main advantage of AEs is that they can be trained in an unsupervised manner.

AEs have been widely used to solve GZSL problem. To achieve this, a decoder can be imposed by an additional constraint for learning different mappings [122] . The framework proposed by Biswas and Annadani [123] integrates the similarity level, e.g., cosine distance, into an objective function to preserve the relations. Latent space encoding (LSE) [124] explores some common semantic characteristics between different modalities and connects them together. For each modality, an encoder-decoder framework is exploited, in which an encoder is used to decompose the inputs as a latent representation while a decoder is employed to reconstruct the inputs. Product quantization ZSL (PQZSL) [125] defines an orthogonal common space, which learns a codebook, to project the visual features and semantic representations into a common space using a center loss function [126] and an auto-encoder, respectively. The orthogonal common space enables the classes to be more discriminative, thus the model can achieve better performances. In addition, PQZSL compresses the visual features into compact codes using quantizers to approximate the nearest neighbors. In adition, study [127] adopts two variational auto-encoder (VAE) models to learn the cross-modal latent features from the visual and semantic spaces, respectively. Cross alignment and distribution alignment strategies are devised to match the features from different spaces.

Besides the aforementiond categories, several studies employ various strategies to tackle GZSL problems. In [73] , [106] , [128] , [129] , the devised methods take into account the inter-class and intra-class relations among different classes. Das and Lee [73] minimized the discrepancy between the semantic representations and visual features using the least square loss method. Rational matrices are constructed for each space, in order to minimize the inter-class pairwise relations between two spaces. To avoid the domain shift problem, a point-to-point correspondence between the semantic representations and test samples is found. The transferable constructive network (TCN) [129] consists of two parts: information fusion and constructive learning. For information fusion, both visual features and semantic representations are encoded into a latent space. Constructive learning checks how well an image is consistent with respect to a class by considering two aspects. The first is whether learning is discriminative enough to recognize different classes, which uses semantic representations of the seen classes for supervision. The second is whether learning is transferable to the unseen classes based on the class similarity score between visual features of the seen and unseen classes. In [130] , the joint optimization of center loss and softmax loss functions are adopted to learn more discriminative visual features for different classes while minimizing the intra-class variations. Besides that, the performance of GZSL can be improved by rectifying the model output.

The studies in [131] , [132] devise a dictionary-based framework for GZSL. The model proposed in [131] jointly aligns the visual-semantic structure to constructs a class structure between the visual and semantic spaces by obtaining the class prototypes in both spaces. The aim is to explore some bases in each space to represent each class. A domain adaptation is proposed to learn the prototypes from both seen and unseen classes in the visual space, in order to alleviate the problem of domain shift. On the other hand, study [132] incorporates a sparse coding technique to construct a dictionary for each projection, i.e., visual-latent and semantic-latent, and applies an orthogonal projection to make the model discriminative.

Since the information obtained from the semantic representations, in particular human-defined attributes, is limited and less discriminative, recognizing data samples from different classes in a specific domain is difficult. The studies in [63] , [131] , [133] , [134] aim to address this issue. The study [63] decomposes the semantic vectors of the training set into K subsets using the k-means clustering algorithm, and projects them to the visual space. This decomposition allows the model to construct a uniform embedding space with a large local relative distance. Similar to [63] , Li et al. [89] used super-classes, which are generated by datadriven clustering, across the seen and unseen class domains to tune two domains. Jin et al. [133] projected the semantic vector of each class to a high-order attribute space by applying the Gaussian random projection. The model proposed in [134] defines a discriminative label space and projects the visual features via a linear projection matrix to that space. Specifically, it fixes the labels of the seen classes and uses the attributes to map the label of the unseen classes into the label space.

Generative-based methods are originally designed to generate examples from the existing ones to train DL models [135] and compensate the imbalanced classification problems [136] . Recently, these methods have been adopted to generate samples (i.e., images or visual features) for unseen classes by leveraging their semantic representations. The generated data samples are required to satisfy two conflicting conditions: (i) semantically related to the real samples, and (ii) discriminative so that the classification algorithm can classify the test samples easily. To satisfy the former condition, some underlying parametric representations can be used, while the classification loss function can be used to satisfy the second condition [137] . Generative adversarial networks (GANs) [138] and variational autoencoders (VAEs) [139] are two prominent members of generative models that have achieved good results in GZSL. In the following sub-sections, a review on generative-based methods under the semantic transductive learning setting is presented. A summary of this category is provided in Table 5 .

GANs generate new data samples by computing the joint distribution p(y, x) of samples utilizing the class conditional density p(x|y) and class prior probability p(y). GANs consist of a generator G SV : Z × A → X that uses semantic attributes A and a random uniform or Gaussian noise z ∈ Z, to generate visual featurex ∈ X , and a discriminator D V : X × A → [0, 1] that distinguishes real visual features from the generated ones. When a generator learns to synthesize data samples for the seen classes conditioned on their semantic representations A s , it can be used to generate data samples for the unseen classes through their semantic representations A u . However, the original GAN models are difficult to train, and there is a lack of variety in the generated samples. In addition, the mode collapse is a common issue in GANs, as there are no explicit constraints in the learning objective. To overcome this issue and stabilize the training procedure, many GAN models with alternative objective functions have been developed. Wasserstein GAN (WGAN) [140] mitigates the mode collapse issue using the Wasserstein distance as the objective function. It applies a weight clipping method on the discriminator to allow incorporation of the Lipschitz constraint. The improved WGAM model [141] reduces the negative effects of WGAN by utilizing gradient penalty instead of weight clipping.

Xian et al. [142] devised a conditional WGAN model with a classification loss to synthesize visual features for the unseen classes, which is known as f-CLSWGAN (see Fig. 10 ). To synthesize the related features, the semantic feature is integrated into both generator and discriminator by minimizing the following loss function:

wherex s = G(z, a s ),x = αx s + (1 − α)x s with α ∼ U (0, 1), λ is penalty factor and discriminator D : X × C → R is a multi-layer perceptron. In (4), the first and second terms approximate the Wasserstein distance, while the last term is the gradient penalty. In addition, to generate discriminative features, the negative log-likelihood is used to minimize the classification loss, i.e.,:

where P (y s |x s ; θ) is the probability thatx s is predicted as class y s , which is estimated by a softmax classifier by θ and optimized by visual features of the seen classes. The final objective function can be written as:

where β is the weighting factor. Since then, the conditional GAN (CGAN) has been combined with different strategies to generative discriminative visual features for the unseen classes, including optimal transport-based approach [143] , TFGNSCS [144] which is an extended version of f-CLSWGAN that considers transfer information, meta-learning approach [97] which is based on model-agnostic meta-learning [145] , LisGAN [146] which focuses on extracting information from soul samples, SP-GAN [147] which devises a similarity preserving loss with classification loss, Semantic rectifying GAN (SR-GAN) [148] which employs the semantic rectifying network (SRN) [149] to rectify features, CIZSL [150] , [151] which is inspired by the human creativity process [152] , MGA-GAN [153] which uses multi-graph similarity, and MKFNet-NFG [154] which is adaptive fusion module based on the attention mechanism. Xie et al. [155] proposed cross knowledge learning and taxonomy regularization to train more relevant semantic features and generalized visual features, respectively. In addition, Bucher et al. [137] developed four different conditional GANs, i.e., generative moment matching network (GMMN) [156] , AC-GAN [157] , denoising autoencoder [158] , and adversarial auto-encoder [159] , to generate data samples for the unseen classes. The empirical results indicate that GMMN outperforms other methods.

However, the generative-based models synthesize highly the unconstrained visual features for the unseen classes that can produce synthetic samples far from the actual distribution of real visual features. To alleviate this issue and address the unpaired training issue during visual feature generation, Verma et al. [59] proposed a cycle consistency loss (which is discussed further in the next subsection), to ensure the generated visual features map back into their respective semantic space. Using this feedback mechanism allows the model to generate more discriminative visual features. In addition, the reconstructing generated features, which are unlabeled, are useful for the model to operate as a semi-supervised setting.

Later, the studies in [160] - [165] devise a cycle consistency loss term in their objective functions. Specifically, DASCN [160] , which is a dual learning model, combines a classification loss function with a semantics-consistency adversarial loss function. In the cycle-consistent adversarial network for ZSL (CANZSL) [162] , visual features are first synthesized from noisy text. Then an inverse adversarial network is adopted to convert the generated features into text, in order to ensure that the synthesized visual features accurately reflect the semantic representations. In addition, in studies [163] - [165] , a multi-modal cycle consistency loss function is developed to preserve the semantic consistency of the generated visual features. AFC-GAN [164] introduces a boundary loss function to maximize the decision boundary between the seen and unseen features. Boomerang-GAN [165] uses a bidirectional auto-encoder to judge the reconstruction loss of the generated features and semantic embedding. Moreover, two-level adversarial visual-semantic coupling (TACO) [166] maximizes the joint likelihood of visual and semantic features by augmenting a generative network with inference. This joint learning allows the model to better capture the underlying modes of the data distribution.

VAE models identify the relationship between data sample x and the distribution of the latent representation z. A parameterized distribution q Φ (z|x) is derived to approximate the posterior probability p(z|x). Similar to GAN, VAE models X Z *

Dense + Dropout Dense Dense Sample Encoder Decoder Fig. 11 : A schematic view of CVAE-ZSL [168] . After concatenating the visual features and the semantic representations, it is passed through dense, dropout and dense layers. Then, another dense layer is used to output µ z and z . Next, a z is sampled from N (µ xi , Σ xi ) and projected to the image space for reconstruction.

consist of two components: (i) an encoder q Φ (z|x) with parameter Φ, and (ii) a decoder p θ (x|z) with parameter θ. The encoder maps sample space x to latent space z with respect to its class c, while a decoder maps the latent space back to the sample. Conditional VAE (CVAE) [167] synthesizes samplex with certain properties by maximizing the lower bound of conditional likelihood p(x|a), i.e.,

Studies [168] , [169] use CVAE-based frameworks to generate visual features for GZSL. CVAE-SZL [168] (Fig. 11 ) adopts a neural network to model both encoder and decoder. During training, encoder computes q(z|x i , A yi ) = N (µ xi , Σ xi ) for each training sample x i . Then, N (µ xi , Σ xi ) is applied to sample z. Next, decoder is employed to reconstruct x using z and a y with the following loss function:

wherex represents the reconstructed sample and L 2 norm indicates the reconstruction loss. In addition, several bi-directional CVAE methods map the generated visual features back to their semantic features to preserve the consistency of both features and produce high-quality examples for unseen classes. In this regard, Verma et al. [59] devised a cycle consistency loss function. Their method, i.e., SE-GZSL, is equipped with a discriminator-driven feedback mechanism that maps a real sample x or generated samplex back to the corresponding semantic representation. The overall loss function of this discriminator is as follows:

where L Sup is a supervised loss that learns from labeled examples, i.e.,

while L U nsup is an unsupervised loss function that learns from the unlabeled generated featuresx, i.e.,

Another CVAE-based bidirectional learning model is GDAN [170] . It contains a discriminator to estimate the similarity level with respect to each visual-textual feature pair. The discriminator communicates with two other networks using a dual adversarial loss function. CADA-VAE [171] learns the shared cross-modal latent features of image and class attributes through a set of distribution alignment (DA) and cross alignment (CA) loss functions. A VAE model [139] is exploited to learn the latent features of each data modality, i.e., image and semantic representations. Then, the CA loss function, which can be estimated through decoding the latent feature of a data sample from other modality of the same class, is used for reconstruction. In addition, the DA loss function is used to match generated image and class representations by minimizing their distance. Chen et al. [172] attempted to minimize the uncertainty in the overlapped areas of both seen and unseen classes using an entropy-based calibration. Li et al. [173] employed a multivariate regressor to map back the output the VAE decoder to the class attributes.

On one hand, VAEs generate blurry images due to the use of the element-wise distance in their structures [174] . On the other hand, the training process of GANs is not stable [175] . To address these limitations, Larsen et al. [174] proposed VAEGAN, in which the discriminator in GAN is used to measure the similarity metric. VAEGAN with similarity measure is able to generate better samples than models with element-wise metrics. Later, Gao et al. [176] , [177] proposed a joint generative model (known as Zero-VAE-GAN) to generate high-quality visual features for the unseen classes. More specifically, CVAE conditioned on semantic attributes is combined with CGAN conditioned on both categories and attributes. Besides that, a categorization network and a perceptual reconstruction loss [178] are incorporated to generate high-quality features. Verma et al. [100] leveraged the meta-learning strategy to train a generative model that integrates CVAE and CGAN to generate visual features from data sets that contain few samples for each class. Moreover, Xu et al. [179] proposed a dual learning framework based on CVAE and CGAN with an additional classifier to generate more discriminative visual features for unseen classes.

Apart from GANs and VAEs, several studies have attempted to generate visual features for the unseen classes using other approaches. As an example, the unseen visual data synthesis (UVDS) [180] method exploits a diffusion regularization (DR) to synthesize visual features for the unseen classes using embedding matrices. The class-specific synthesized dictionary (CSSD) [181] learns a class-specific encoding matrix in a latent space for each class and consequently a dictionary matrix within a dictionary framework. Then, the encoding matrices with the affinity seen classes, i.e., seen classes similar to unseen ones, are used to generate visual features for the unseen classes. Feng and Zhao [182] built a generative model by extracting two types of knowledge, i.e., local rational knowledge and global rational knowledge, from the visual and semantic representations, respectively. Li et al. [183] leveraged the most similar seen classes to the unseen ones in the semantic space to generate visual features for the unseen classes.

In [99] , [184] - [187] , autoencoders are used to generate visual features for the unseen classes. Shi and Wei [184] developed an encoder to map the visual features to the semantic embedding space. The regressor feedback is imposed into the decoder to reconstruct truthful visual features. Then, the learned decoder generates visual features for the unseen classes to train a classifier. Bi-adversarial auto-encoder (BAAE) [185] pairs an autoencoder with two adversarial networks. On one hand, the encoder, which operates as a generator, integrates the visual and synthesized features into an adversarial network to capture the real distribution. On the other hand, the decoder, which acts as semantic inference, formulates real class semantics for inference toward another adversarial network to enforce both real and synthesized visual features to be related to the semantic representation. Similar to BAAE, the multi-modality adversarial auto-encoder (MAAE) [186] pairs an auto-encoder with multi-modality adversarial networks. The encoder generates visual features while the decoder aims to relate both generated and real features to the class semantics. Both BAAE and MAAE integrate classification networks to ensure that the generated and real features are discriminative. Moreover, Liu et al. [99] tackled the limitations caused by diverse data distribution in GZSL by proposing a meta-learning framework based on autoencoders.

In contrast, the studies in [69] , [98] , [188] attempt to synthesize classifiers instead of generating visual features for the unseen classes. EXEM [69] , [188] learns a function to predict the locations of visual features with respect to the unseen classes. The visual exemplar of each class is created via averaging the Principal Component Analysis (PCA) projection of the data samples belonging to a particular class. Then, D regressors are learned with respect to the visual exemplar semantic representations. Finally, the similarity level between the test samples and bases is computed with regard to the predicted exemplars to produce a prediction. E-PGN [98] , which is an episode-based framework, generates class-level visual samples conditioned on semantic representations.

As explained earlier, GZSL methods under the transductive learning setting can alleviate the domain shift problem by taking unlabeled data samples from the unseen classes into consideration. Accessing the unlabeled data allows the model to know the distribution of the unseen classes and consequently learn a discriminative projection function. It also permits synthesis of the related visual features for the unseen classes by using generative based methods. Since the number of publications is limited, we categorize transductive-based GZSL methods into two: embeddingbased and generative-based methods. Table 6 summarizes the transductive-based GZSL methods.

This category of transductive-based GZSL methods leverages the unlabeled data samples from the unseen classes to learn a projection function between the visual and semantic spaces. Using these unlabeled data samples, the geometric structure of the unseen classes can be estimated. Then, a discriminative projection function in the common space, i.e., semantic, visual, or latent, can be formulated, in order to alleviate the domain shift problem. To map the visual features of the seen classes to their corresponding semantic representations, a classification loss function is derived, and an additional constraint is required to extract useful information from the data samples from the unseen classes. An example is Quasi-fully supervised learning (QFSL) [189] (Fig. 12) , which projects the visual features of the seen classes into a number of fixed points in a semantic space using a fully connected network with the ReLU activation function. The unlabeled visual features from the unseen classes are mapped onto other points with the following loss function.

where L p and Ω are the classification loss function and regularization factor, L b is the bias loss, i.e.,

where p u indicates the predicted probability of the u-th class.

Other semantic embedding models have been proposed, e.g. [16] , [72] and [190] . Fu et al. [16] developed a semantic manifold structure of the class prototypes distributed in the embedding space. A dual visual semantic mapping paths (DMaP) [190] is formulated to learn the connection between the semantic manifold structure and visual-semantic mapping of the seen classes. It extracts the inter-class relation between the seen and unseen classes in both spaces. For a given test sample, if the inter-class relationship consistency is satisfied, a prediction is produced. On the other hand, adaptive embedding ZSL (AEZSL) [72] learns a visualsemantic mapping for each unseen class by assigning higher weights to the recognition tasks pertaining to more relevant seen classes.

Several studies [35] , [36] , [191] focus on the development of visual embedding-based models under the transductive 13 setting. Hu et al. [36] leveraged super-class prototypes, instead of the seen/unseen class prototypes, to align with the seen and unseen class domains. The K-means clustering algorithm is used to group the semantic representations of all seen and unseen classes into r superclass. The DTN [191] , which is a probabilistic-based method, decomposes the training phase into two independent parts. One phase adopts the cross-entropy loss function to learn the labeled seen classes, while the other part applies a combination of cross-entropy loss function and Kullback-Leibler divergence to learn from the unlabeled samples in the unseen classes. In the second phase, cross-entropy is adopted to avoid the transferring of samples from the unseen classes into seen classes. MFMR [192] employs a matrix tri-factorization [193] framework to construct a projection function in the latent space. Two manifold regularization terms are formulated to preserve the geometric structure in both spaces, while a test time manifold structure is developed to alleviate the shift problem. In [37] , [38] , the pseudo labeling technique is used to solve the bias problem by incorporating the testing samples into the training process.

This category of GZSL methods uses the unlabeled unseen data samples to generate the related visual features for the unseen classes. A generative model is usually formulated using the data samples of the seen classes, while the unlabeled samples from the unseen classes are used to finetune the model based on unsupervised strategies [176] , [194] . As an example, Gao et al. [176] developed two selftraining strategies based on pseudo-labeling procedure. The first strategy uses K-nearest neighbour (K-NN) algorithm to provide pseudo-labels for the unseen visual features. As such, the semantic representations of the unseen classes are used to generate N fake unseen visual features from the Gaussian distribution. Then, for each unseen class, the average of N fake features is used as an anchor. At the same time, K-NN is used to update the anchor. Finally, the top M percent of unseen features are selected to finetune the generative model. The second strategy obtains the pseudo-labels directly through the classification probability. SABR-T [32] uses WGAN to generate the latent space for the unseen classes via minimizing the marginal difference between the true latent space representation of the unlabeled samples of unseen classes and the generated space. On the other hand, VAEGAN-D2 [195] learns the marginal feature distribution of the unlabeled samples using an additional unconditional discriminator.

Due to the ubiquitous demand of machine learning techniques for large scale data sets and the rapid digital advances in recent years, the GZSL methods are now being applied to a variety of applications, particularly in computer vision and natural language processing (NLP). We discuss these applications in the following subsections, which are divided into image classification (the most popular application of GZSL), object detection, video processing, and NLP.

In computer vision, GZSL is applied to solve problems related to both images and videos.

Image processing: Image classification is among the most popular applications of GZSL methods. The aim is to classify images from both seen and unseen classes. In this regard, a number of standard benchmark data sets have been introduced. ImageNet [1], which is a WordNet hierarchy [196] data set, is one of the well-known image data sets that is widely used in various studies with different settings in the computer vision and image processing domain. The original ImageNet includes over 15 million labeled images (high-resolution) related to approximately 22,000 classes (categories). Several researchers used the full ImageNet data set in their studies, e.g. [197] , [198] .

On the other hand, there exist several attribute-based data sets. CUB-200-2011 (CUB) [199] is an extended form of the original CUB-200 data set [200] . It contains approximately two times more images in each category as compared with those in the original CUB-200 version. All images in CUB-200-2011 are annotated with part locations, attribute labels, and bounding boxes. The aPascal-aYahoo data set [201] contains 12,695 images pertaining to the original PASCAL VOC 2008 data samples categorized into 20 different classes (aPascal). It also includes 2,644 images collected using the Yahoo image search engine (aYahoo) related to 12 various classes. All images are annotated by 64 binary attributes for specifying the visible objects. The Animal with Attribute (AWA) [14] contains over 30,000 animal images belonging to 50 classes. AWA1 [202] is a coarsegrained version of AWA, which includes 30,475 images in 40 classes for training and 10 additional classes for test. AWA2 [21] is an extended version of AWA1 that includes more images. Scene UNderstanding (SUN) [203] , which is a 

Embedding Less complex structure and easy to implement; various projection functions such as linear and non-linear according to the properties of the data set can be selected. However, these methods suffer from hubness, shift and bias problems when a semantic embedding space is learned, as well as shift and bias problems when a visual or latent space is learned.

Large number of samples can be generated for the unseen classes; a variety of supervised learning models can be used to perform recognition. However, they are complex in structure, difficult to train, unstable in training, and susceptible to the mode collapse issue.

well-known scenes related data set, contains 130,519 images in 899 classes. The SUN attribute data set [204] , [205] is a subset of the original SUN version for fine-grained scene classification. It has 14,340 images related to 717 classes. North America Birds (NAB) [206] is another data set of birds, including 48,562 images belonging to 1,011 categories. Among the aforementioned data sets, the attributes of CUB are focused on local information, such as "has wing pattern spotted" and "has throat color orange". The attributes of SUN covers multiple wide areas such as "trees" and "manmade", although it has limited images per class. Note that AWA2, SUN and CUB data sets have evenly distributions of samples per class. In contrast, DeepFashion [207] ,which is a large-scale data set of clothes with massive attributes of over 800,000 images, follows a more realistic long-tail distribution. Other available data sets pertaining to image classification tasks include Large-Scale Attribute Data set (LADdataset) [208] , Dogs [209] , Oxford Flowers (FLO) [210] . Table 1 summarizes the details of the data sets.

Object detection is another popular application in computer vision that has increasingly gained importance for large-scale tasks. It aims to locate objects, in addition to their recognition. Over the years, many CNN-based models have been developed to recognize the seen classes. However, collecting sufficient annotated samples with groundtruth bounding-boxes is not scalable, and new methods have emerged. Currently, advances in zero-shot detection allow the conventional object detection models to detect classes that do not match previously learned ones. The reported methods in the literature include detecting general categories from single label [211] - [216] to multi-label [217] , [218] , multi-view (CT and x-ray ) [219] and tongue constitution recognition [220] . In addition, GZSL has been applied to segment both seen and unseen categories [221] , [222] , retrieve images from large scale data sets [223] , [224] as well as perform image annotation [45] .

Video processing: Recognizing human actions and gestures from videos is one of the most challenging tasks in computer vision, due to a variety of actions that are not available among the seen action categories. In this regard, GZSL-based frameworks are employed to recognize single label [23] , [168] , [225] , [226] and multi-label [227] human actions. As an example, CLASTER [226] is a clustering-based method using reinforcement learning. In [227] , a multi-label ZSL (MZSL) framework using JLRE (Joint Latent Ranking Embedding) is proposed. The relation score of various action labels is measured for the test video clips in the semantic embedding and joint latent visual spaces. In addition, a multi-modal framework using audio, video, and text was introduced in [228] , [229] .

NLP [230] , [231] and text analysis [232] are two important application areas of machine learning and deep learning methods. In this regards, GZSL-based frameworks have been developed for different NLP applications such as text classification with single label [90] , [162] , multi-label [102] , [233] as well as noisy text description [46] , [47] . Wang et al. [90] proposed a method based on the semantic embedding and categorical relationships, in which benefits of the knowledge graph is exploited to provide supervision in learning meaningful classifiers on top of the semantic embedding mechanism.

The application of GZSL on multi-label text classification plays a key role in generating latent features in text data. Song et al. [233] proposed a new GZSL model in a study on International Classification of Diseases (ICD) to identify classification codes for diagnosis of various diseases. The model improves the prediction on the unseen ICD codes without compromising its performance on seen ICD codes. On the other hand, for multi-label ZSL, Huynh and Elhamifar [102] trained all applied models on the seen labels and then tested on the unseen labels, whereas for multi-label GZSL, they tested all models on the both seen and unseen labels. In this regard, a new shared multi-attention mechanism with a novel loss function is devised to predict all labels of an image, which contain multiple unseen categories.

In this section, we discuss the main findings of our review. Several research gaps that lead to future research directions are highlighted.

We categorize the GZSL methods into two groups: (i) embedding-based, and (ii) generative-based methods. Embedding-based methods learn an embedding space, whether visual-semantic, semantic-visual, common/latent space, or a combination of them (bidirectional), to link the visual space of the seen classes into their corresponding semantic space. They use the learned embedding space to recognize data samples from both seen and unseen classes. Able to preserves the geometric structure of features in the latent space, which leads to a compact representation; however, learning a classifier using graph information is challenging and they contain a complex structure.

Meta learning Able to learn discriminative features without labeled samples from novel classes even without fine-tuning.

Able to identify fine-grained classes and recognize important information pertinent to performing GZSL tasks; however, it increases computational load, affecting real-time implementation.

Bidirectional learning Able to learn more generalizable projections.

Able to learn in an unsupervised manner with a simple structure, which leads to more discriminative features and reduced structural complexity.

Generative GANs Able to alleviate the bias and domain shift problems; however, GANs are difficult to train, and there is a lack of variety in the generated samples; in addition, they are susceptible to mode collapse and an unstable training process.

Able to alleviate the bias and domain shift problems; however,VAEs generate blurry images due to the use of the element-wise distance in their structures.

Various strategies have been developed to learn the embedding space. We categorize these strategies into graphbased, autoencoder-based, meta-learning-based, attentionbased, compositional learning-based, bidirectional learningbased and other methods. In contrast, the generative-based methods convert GZSL into a conventional supervised learning problem by generating visual features for the unseen classes. Since the visual features of the unseen classes are not available during training, learning a projection function or a generative model using data samples from the seen classes cannot guarantee generalization pertaining to the unseen classes. Although the low-level attributes are useful for classification of the seen classes, it is challenging to extract useful information from structured data samples due to the difficulty in separating different classes. Thus, the main challenge of both methods is the lack of unseen visual samples for training, leading to the bias and domain shift problems. Table 2 summarizes the main properties of the embedding-and generative-based methods. In addition, Table 3 describes different embedding-and generative-based methods.

Embedding-based vs. Generative-based methods: Embedding-based models have been proposed to address the ZSL and GZSL problems. These methods are less complex, and they are easy to implement. However, their capabilities in transferring knowledge are restricted by semantic loss, while the lack of visual samples for the unseen classes causes the bias problem. This results in poor performances of the models under the GZSL setting. This can be seen in Table 4 , where the accuracy of seen classes (Acc s ) are higher than accuracy of unseen classes (Acc u ). In semantic embedding models, the projection of visual features to a low dimensional semantic space shrinks their variance and restricts their discriminability [58] . The compatibility scores are unbounded, and ranking may not be able to learn certain semantic structures because of the fixed margin. Moreover, they usually perform search in the shared space, which causes the hubness problem [43] , [51] , [54] .

Although visual embedding models are able to alleviate the hubness problem [51] , they suffer from several issues. Firstly, the visual features and semantic representations are obtained independently, and they are heterogeneous, e.g., they are from different spaces. As such, the data distributions of both spaces can be different, in which two close categories in one space can be located far away in other space [62] . In addition, regression-based methods can not explicitly discover the intrinsic topological structure between two spaces. Therefore, learning a projection function directly from the space of one modality to the space of another modality may cause information loss, increased complexity, and overfitting toward the seen classes (see Table 4 ). To mitigate this issue, several studies have attempted to project visual features and semantic representations into a latent space. Such a projection can reconcile the structural differences between the two spaces. Besides that, bidirectional learning models aim to learn a better projection and adjust the seen-unseen class domains. However, these models still suffer from the domain shift and bias problems.

Generative-based methods are more effective in solving the bias problem, owing to synthesizing visual features for the unseen classes. This leads to a slightly balanced Acc s and Acc u performance, and consequently higher harmonic mean (H), as compared with those from embedding based methods (see Tables 4 and 5 ). However, the generative based methods are susceptible to the bias problem. While, the availability of visual samples for the unseen classes allows the models to perform recognition of both seen and unseen classes in a single process, they generate visual features through learning a model class conditioned on the seen classes. They do not learn a generic model for generalization toward both seen and unseen class generations [97] . They are complex in structure and difficult to train (owing to instability). Their performance is restricted either by obtaining 4: A summary of embedding-based methods ("S", "V" and "L" represent the semantic, visual and latent embedding space, respectively, and "-" indicates the item is not reported by the corresponding study). the distribution of visual features using semantic representations or using the Euclidean distance as the constraint to retain the information between the generated visual features and real semantic representations. In addition, the unconstrained generation of data samples for the unseen classes may produce samples far from their actual distribution.

These methods have access to the semantic representations (under semantic transductive setting) and unlabeled unseen data (under transductive setting), which violate the GZSL setting.

Inductive setting vs. Transductive setting: As stated earlier, GZSL methods under the transductive setting know the distribution of the unseen classes, as they have access to the unlabeled samples of the unseen classes. Therefore, they can solve the bias and shift problems. This results in a balanced Acc s and Acc u performance with a high H score as compared with those from GZSL methods under the inductive and semantic transductive settings (Table 6 vs. Tables 4 and 5).

Despite considerable progresses in GZSL, methodologies, there are challenges pertaining to the unavailability of visual samples for the unseen classes. Based on our findings, the main research gaps that require further investigations are as follows:

• Methodology: Although many studies to solve the domain shift problem are available in the literature, it remains a key challenging issue with respect to the existing models. Most of the existing GZSL methods are based on ideal data sets, which is unrealistic in real-world scenarios. In practice, the ideal setting is affected by uncertain disturbances, e.g. a few samples in each category are different from other samples. Therefore, developing robust GZSL models is crucial. To achieve this, new frameworks that incorporate domain classification without relying on latent space learning are required. On the other hand, techniques that are capable of solving supervised classification problems can be A summary of transductive-based methods ("S", "V" and "L" represent the semantic, visual and latent embedding spaces, respectively, and "-" indicates the score is not reported by the corresponding study). adopted to solve the GZSL problem, e.g. ensemble models [236] and meta-learning strategy [97] . Ensemble models employ a number of individual classifiers to produce multiple predictions, and the final decision is reached by combining the predictions [237] , [238] . Recently, Felix et al. [64] introduced an ensemble of visual and semantic classifiers to explore the multi-modality aspect of GZSL. Meta-learning aims to improve the learning ability of the model based on the experience of several learning episodes [97] , [107] . GZSL also can be combined with reinforcement learning [226] , [239] , [240] to better tackle new tasks. In addition, GZSL can be extended in several fronts, which include multi-modal learning [76] , [228] , [229] , multi-label learning [218] , multi-view learning [219] , weakly supervised learning to progressively incorporate training instances from easy to hard [86] , [124] , continual learning [96] , long-tail learning [241] , or online learning for few-shot learning where a small portion of labeled samples from some classes are available [59] .

Recently, transformer-based language models have shown superior performance on various NLP tasks.

As an example, generative pre-training (GPT) [242] is a semi-supervised technique that combines unsupervised pre-training with supervised fine-tuning techniques. It uses a large corpus of unlabelled text along with a limited number of manually annotated samples in its operation. GPT-2 [243] and GPT-3 [244] are improved models of GPT, which are trained on large-scale data sets for solving tasks under ZSL and few-shot settings, respectively. Moreover, CLIP [245] and DALL-E [246] are transformer-based techniques for zero-shot text-to-image generation. However, this field requires further investigations.

• Data: Semantic representations play an important role in bridging the gap between the seen and unseen classes. The existing data sets use human-defined attributes or word vectors. The former shares the same attributes for different classes, and human labor is required for annotation, which is not suitable for large-scale data sets. The latter automatically extracts information from a large test corpus, which is usually noisy. Therefore, extracting useful knowledge from such data sets is difficult, especially fine-grained data sets [228] , [229] . In this regard, using other modalities such as audio can improve the quality of the data samples. As such, it is crucial to focus on developing new techniques to automatically generate discriminative semantic attribute vectors, exploring other semantic spaces or combinations of various semantic embeddings that can accurately formulate the relationship between seen and unseen classes, and consequently solve the domain shift and bias problems. In addition, there are many unforeseen scenarios in real-world applications, such as driverless cars or action recognition that may not be included among the seen classes. While GZSL can be applied to tackle such problems, no suitable data sets for such applications are available at the moment. This constitutes a key focus area for future research.

Building models that can simultaneously perform recognition for both seen (source) and unseen (target) classes are vital for advancing intelligent data-based learning models. This paper, to the best of our knowledge, for the first time presents a comprehensive review of GZSL methods. Specifically, a hierarchical categorization of GZSL methods along with their representative models has been provided. In addition, evaluation protocols including benchmark problems and performance indicators, together with future research directions have been presented. This review aims to promote the perception related to GZSL.

Ima-geNet: A large-scale hierarchical image database

Generalized zero-shot learning with deep calibration network

A survey of zero-shot learning: Settings, methods, and applications

Recent advances in deep learning

Spinalnet: Deep neural network with gradual input

A review of uncertainty quantification in deep learning: Techniques, applications and challenges

One-shot learning of object categories

Toward open set recognition

Toward open set recognition

Generalized out-ofdistribution detection: A survey

Recognition-by-components: A theory of human image understanding

Zero-data learning of new tasks

Attribute-based classification for zero-shot visual object categorization

Learning to detect unseen object classes by between-class attribute transfer

Zero-shot learning through cross-modal transfer

Zero-shot learning on semantic class prototype graph

Generalized zero-shot learning with multi-source semantic embeddings for scene recognition

Pseudo distribution on unseen classes for generalized zero shot learning

An embarrassingly simple approach to zero-shot learning

An empirical study and analysis of generalized zero-shot learning for object recognition in the wild

Zero-shot learning -the good, the bad and the ugly

Zero-shot learning-a comprehensive evaluation of the good, the bad and the ugly

Generalized zero-shot learning for action recognition with web-scale video data

Recent advances in zero-shot recognition: Toward data-efficient understanding of visual content

Zero-shot learning and its applications from autonomous vehicles to covid-19 diagnosis: A review

Deep residual learning for image recognition

Very deep convolutional networks for large-scale image recognition

Going deeper with convolutions

Generalized zero-shot recognition based on visually semantic embedding

Triple verification network for generalized zero-shot learning

Self-focus deep embedding model for coarse-grained zero-shot classification

Semantically aligned bias reducing zero shot learning

Transductive zero-shot learning for 3d point cloud classification

Transductive learning for zero-shot object detection

Extreme reverse projection learning for zero-shot recognition

Zero-shot learning with superclasses

Pseudo transfer with marginalized corrupted attribute for zero-shot learning

Towards effective deep embedding for zeroshot learning

Holistically-associated transductive zero-shot learning

A unified approach for conventional zero-shot, generalized zero-shot, and few-shot learning

Distributed representations of words and phrases and their compositionality

Semantic autoencoder for zero-shot learning

Simple is better: A global semantic consistency based end-to-end framework for effective zero-shot learning

A novel dataset-specific feature extractor for zero-shot learning

Inductive zero-shot image annotation via embedding graph

Link the head to the" beak": Zero shot learning from noisy text description at part precision

A generative adversarial approach for zero-shot learning from noisy 20

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Webly supervised semantic embeddings for large scale zero-shot learning

Zest: Zero-shot learning from text descriptions using textual similarity and visual summarization

Multi-cue zero-shot learning with strong supervision

Label-activating framework for zero-shot learning

Ridge regression, hubness, and zero-shot learning

Semantic similarity based softmax classifier for zero-shot learning

Zeroshot visual recognition using semantics-preserving adversarial embedding networks

Deep unbiased embedding transfer for zero-shot learning

Zero-shot learning via shared-reconstruction-graph pursuit

Dissimilarity representation learning for generalized zero-shot recognition

Learning a deep embedding model for zero-shot learning

Generalized zero-shot learning via synthesized examples

Hubs in space: Popular nearest neighbors in high-dimensional data

Improving zero-shot learning by mitigating the hubness problem

Transductive multi-view zero-shot learning

Co-representation network for generalized zero-shot learning

Multi-modal ensemble classification for generalized zero shot learning

Autoencoder based novelty detection for generalized zero shot learning

Adaptive confidence smoothing for generalized zero-shot learning

Domainaware visual bias eliminating for generalized zero-shot learning

From classical to generalized zero-shot learning: A simple adaptation process

Classifier and exemplar synthesis for zero-shot learning

Attribute prototype network for zero-shot learning

Dual-view ranking with hardness assessment for zeroshot learning

Zero-shot learning via category-specific visual-semantic mapping and label refinement

Zero-shot image recognition using relational matching, adaptation and calibration

Fine-grained generalized zero-shot learning via dense attribute-based attention

Clarel: Classification via retrieval loss for zero-shot learning

Augmentation network for generalised zero-shot learning

Distilling the knowledge in a neural network

Domain segmentation and adjustment for generalized zero-shot learning

Improving generalized zero-shot learning by semantic discriminator

Cluster-based zero-shot learning for multivariate data

Generalised zero-shot learning with domain classification in a joint semantic and visual space

Guided CNN for generalized zero-shot and open-set recognition using visual and semantic prototypes

Graph neural networks: A review of methods and applications

Graph learning: A survey

Asymmetric graph based zero shot learning

Semi-supervised low-rank semantics grouping for zero-shot learning

A novel approach based on fully connected weighted bipartite graph for zero-shot learning problems

Zero-shot learning in the presence of hierarchically coarsened labels

Transferrable feature and projection learning with class hierarchy for zero-shot learning

Zero-shot recognition via semantic embeddings and knowledge graphs

From anchor generation to distribution alignment: Learning a discriminative embedding space for zero-shot recognition

Region graph embedding network for zero-shot 21

European Conference on Computer Vision

Semi-supervised classification with graph convolutional networks

Learning to compare: Relation network for fewshot learning

Correction networks: Metalearning for zero-shot learning

Meta-learned attribute self-gating for continual generalized zero-shot learning

Meta-learning for generalized zero-shot learning

Episode-based prototype generating network for zero-shot learning

Task aligned generative meta-learning for zero-shot learning

Towards zero-shot learning with fewer seen class examples

Attentive region embedding network for zero-shot learning

A shared multi-attention framework for multi-label zero-shot learning

Stacked semanticsguided attention model for fine-grained zero-shot learning

Goal-oriented gaze estimation for zero-shot learning

Simple and effective localized attribute representations for zero-shot learning

Semanticguided multi-attention localization for zero-shot learning

Attribute propagation network for graph zero-shot learning

Disan: Directional self-attention network for RNN/CNN-free language understanding

A semantics-guided class imbalance learning model for zero-shot classification

Zero-shot recognition through imageguided semantic classification

Detecting visually relevant sentences for fine-grained classification

From red wine to red tomato: Composition with context

Learning graph embeddings for compositional zero-shot learning

Compositional learning for human object interaction

Neil: Extracting visual knowledge from web data

Taskdriven modular networks for zero-shot compositional learning

Locality and compositionality in zero-shot learning

Compositional zero-shot learning via fine-grained dense feature composition

A causal view of compositional zero-shot recognition

Joint projection and subspace learning for zero-shot recognition

Dual triplet network for image zero-shot learning

Zero shot learning via low-rank embedded semantic autoencoder

Preserving semantic relations for zero-shot learning

Zero-shot learning via latent space encoding

Compressing unknown images with product quantizer for efficient zero-shot classification

SitNet: Discrete similarity transfer network for zero-shot hashing

Generalized zero-shot learning with multichannel gaussian mixture vae

Modeling inter and intra-class relations in the triplet loss for zero-shot learning

Transferable contrastive network for generalized zero-shot learning

Discriminant zero-shot learning with center loss

Learning class prototypes via structure alignment for zero-shot recognition

Learning discriminative domain-invariant prototypes for generalized zero shot learning

Beyond attributes: High-order attribute features for zero-shot learning

A joint label space for generalized zero-shot classification

Deep generative image models using a laplacian pyramid of adversarial networks

Learning from imbalanced data sets with boosting and data generation: The databoost-im approach

Generating visual representations for zero-shot classification

Generative adversarial nets

Auto-encoding variational bayes

Wasserstein generative adversarial networks

Improved training of wasserstein GANs

Feature generating networks for zero-shot learning

Zero-shot recognition via optimal transport

Transfer feature generating networks with semantic classes structure for zero-shot learning

Model-agnostic meta-learning for fast adaptation of deep networks

Leveraging the invariant side of generative zero-shot learning

Similarity preserving feature generating networks for zero-shot learning

SR-GAN: Semantic rectifying generative adversarial network for zero-shot learning

Rectifier nonlinearities improve neural network acoustic models

Creativity inspired zero-shot learning

Cizsl++: Creativity inspired generative zero-shot learning

The Clockwork Muse: The Predictability of Artistic Change

Generalized zero-shot learning with multiple graph adaptive generative networks

Multi-knowledge fusion for new feature generation in generalized zero-shot learning

Cross knowledge-based generative zero-shot learning approach with taxonomy regularization

Generative moment matching networks

Conditional image synthesis with auxiliary classifier GANs

Generalized denoising auto-encoders as generative models

Adversarial autoencoders

Dual adversarial semanticsconsistent network for generalized zero-shot learning

Unpaired image-toimage translation using cycle-consistent adversarial networks

Canzsl: Cycleconsistent adversarial networks for zero-shot learning from natural language

Multi-modal cycle-consistent generalized zero-shot learning

Alleviating feature confusion for generative zero-shot learning

Investigating the bilateral connections in generative zero-shot learning

Two-level adversarial visual-semantic coupling for generalized zero-shot learning

Learning structured output representation using deep conditional generative models

A generative model for zero shot learning using conditional variational autoencoders

Don't even look once: Synthesizing features for zero-shot detection

Generative dual adversarial network for generalized zero-shot learning

Generalized zero-and few-shot learning via aligned variational autoencoders

Entropy-based uncertainty calibration for generalized zero-shot learning

Generalized zero shot learning via synthesis pseudo features

Autoencoding beyond pixels using a learned similarity metric

Improved techniques for training gans

Zero-VAE-GAN: Generating unseen features for generalized and transductive zero-shot learning

A joint generative model for zero-shot learning

Perceptual losses for realtime style transfer and super-resolution

Dual generative network with discriminative information for generalized zero-shot learning

Zero-shot learning using synthesised unseen visual data with diffusion regularisation

Class-specific synthesized dictionary model for zero-shot learning

Transfer increment for generalized zeroshot learning

Learning domain invariant unseen features for generalized zero-shot classification

Discriminative embedding autoencoder with a regressor feedback for zero-shot learning

Biadversarial auto-encoder for zero-shot learning

Multi-modality adversarial auto-encoder for zero-shot learning

A generative model for zero-shot learning via wasserstein auto-encoder

Predicting visual exemplars of unseen classes for zero-shot learning

Transductive unbiased embedding for zero-shot learning

Zero-shot recognition using dual visual-semantic mapping paths

Deep transductive network for generalized zero shot learning

Matrix tri-factorization with manifold regularizations for zeroshot learning

Orthogonal nonnegative matrix t-factorizations for clustering

Selfsupervised domain-aware generative network for generalized zero-shot learning

f-vaegan-d2: A feature generating framework for any-shot learning

WordNet: a lexical database for english

Zero-shot learning by convex combination of semantic embeddings

Synthesized classifiers for zero-shot learning

The Caltech-UCSD Birds-200-2011 Dataset

Caltech-ucsd birds 200

Describing objects by their attributes

Attribute-based classification for zero-shot visual object categorization

Sun database: Large-scale scene recognition from abbey to zoo

Sun attribute database: Discovering, annotating, and recognizing scene attributes

The sun attribute database: Beyond categories for deeper scene understanding

Building a bird recognition app and large scale dataset with citizen scientists: The fine print in finegrained dataset collection

Deepfashion: Powering robust clothes recognition and retrieval with rich annotations

A largescale attribute dataset for zero-shot learning

Novel dataset for fine-grained image categorization: Stanford dogs

Automated flower classification over a large number of classes

Don't even look once: Synthesizing features for zero-shot detection

Zero shot learning for visual object recognition with generative models

Hierarchical novelty detection for visual object recognition

Context-aware zero-shot learning for object recognition

Mitigating the hubness problem for zero-shot learning of 3d objects

Zero-shot learning with deep neural networks for object recognition

Multi-label zero-shot learning with structured knowledge graphs

Generative multi-label zero-shot learning

Generalized zero-shot chest x-ray diagnosis through trait-guided multi-view semantic embedding with selftraining

Grouping attributes zero-shot learning for tongue constitution recognition

Zero-shot semantic segmentation

Zero-shot semantic segmentation using relation network

Generalized zero-shot cross-modal retrieval

OCEAN: A dual learning approach for generalized zero-shot sketch-based image retrieval

Out-of-distribution detection for generalized zero-shot action recognition

CLASTER: Clustering with reinforcement learning for zero-shot action recognition

Multi-label zero-shot human action recognition via joint latent ranking embedding

Coordinated joint multimodal embeddings for generalized audio-visual zeroshot classification and retrieval of videos

AVGZSLNet: audio-visual generalized zero-shot learning by reconstructing label features from multi-modal embeddings

A novel method for sentiment classification of drug reviews using fusion of deep and machine learning techniques

ABCDM: An attention-based bidirectional CNN-RNN deep model for sentiment analysis

Energy choices in alaska: Mining people's perception and attitudes from geotagged tweets

Generalized zero-shot text classification for icd coding

San: Sampling adversarial networks for zero-shot learning

Generative model with semantic embedding and integrated classifier for generalized zero-shot learning

Neural network ensembles

An improved fuzzy artmap and q-learning agent model for pattern classification

A q-learning-based multi-agent system for data classification

Introduction to reinforcement learning

A reinforced fuzzy artmap model for data classification

From generalized zeroshot learning to long-tail with class descriptors

Improving language understanding by generative pre-training

Language models are unsupervised multitask learners

Language models are few-shot learners

Learning transferable visual models from natural language supervision

Zero-shot text-to-image generation