key: cord-0949143-eczuixx6
authors: Rezaei, Mahdi; Shahidi, Mahsa
title: Zero-Shot Learning and its Applications from Autonomous Vehicles to COVID-19 Diagnosis: A Review
date: 2020-10-02
journal: Intell Based Med
DOI: 10.1016/j.ibmed.2020.100005
sha: 3a8e47c9ee43a2477a190431fce03de49d458c2d
doc_id: 949143
cord_uid: eczuixx6

The challenge of learning a new concept, object, or a new medical disease recognition without receiving any examples beforehand is called Zero-Shot Learning (ZSL). One of the major issues in deep learning based methodologies such as in Medical Imaging and other real-world applications is the requirement of large annotated datasets prepared by clinicians or experts to train the model. ZSL is known for having minimal human intervention by relying only on previously known or trained concepts plus currently existing auxiliary information. This is ever-growing research for the cases where we have very limited or no annotated datasets available and the detection [Formula: see text] recognition system has human-like characteristics in learning new concepts. This makes the ZSL applicable in many real-world scenarios, from unknown object detection in autonomous vehicles to medical imaging and unforeseen diseases such as COVID-19 Chest X-Ray (CXR) based diagnosis. In this review paper, we introduce a novel and broaden solution called Few [Formula: see text] one-shot learning, and present the definition of the ZSL problem as an extreme case of the few-shot learning. We review over fundamentals and the challenging steps of Zero-Shot Learning, including state-of-the-art categories of solutions, as well as our recommended solution, motivations behind each approach, their advantages over each category to guide both clinicians and AI researchers to proceed with the best techniques and practices based on their applications. Inspired from different settings and extensions, we then review through different datasets inducing medical and non-medical images, the variety of splits, and the evaluation protocols proposed so far. Finally, we discuss the recent applications and future directions of ZSL. We aim to convey a useful intuition through this paper towards the goal of handling complex learning tasks more similar to the way humans learn. We mainly focus on two applications in the current modern yet challenging era: coping with an early and fast diagnosis of COVID-19 cases, and also encouraging the readers to develop other similar AI-based automated detection [Formula: see text] recognition systems using ZSL.

Object recognition is one of the highly researched areas of computer vision. Recent recognition models have led to great performance through established techniques and large annotated datasets. After several years of research, the attention over this topic has not only dimmed but it has been proven that there are still ways and rooms to refine models to eliminate existing issues in this area. The number of newly emerging unknown objects are growing. Some examples of these unseen or rarely-seen objects are futuristic object designs like the next generation of concept cars, other existing concepts but with restricted access to them (such as licensed or private medical imaging datasets), or rarely seen objects (such a traffic signs with graffiti on them), or fine-grained categories of objects (such as detection of COVID-19 in comparison with the easier task of detecting a common pneumonia). This brings the necessity of developing a fresh way of solving object recognition problems that concern lesser human supervision and lesser annotated datasets. Several approaches have tried to gather web images to train the developed deep learning models, but aside from the problem of the noisy images, the searched keywords are still a form of human supervision. One-Shot learning (OSL) and Few-shot learning (FSL) are two solutions that are able to learn new categories via one or a few images, respectively [104] , [76] , [70] .

Natural language processing (NLP) is another major area of research in AI and the application of Few-shot learning in the integration of NLP and object recognition has become a hot topic recently. [165] was the first FSL-based model to improve the performance of an NLP system. Zero-shot learning (ZSL) [80] , [7] , [188] , [38] , [178] is an emerging research which is completely free of any laborious task of data collection and annotation by experts. Zero-shot learning is a novel concept and learning technique without accessing any exemplars of the unseen categories during training, yet it is able to build recognition models with the help of transferring knowledge from previously seen categories and auxiliary information. The auxiliary information may include textual description, attributes, or vectors of word labels. This means the ZSL is interdisciplinary by nature with two inseparable components of visual and textual data.

One of the interesting facts about ZSL is its similarity with the way human learns and recognise a new concept without seeing them beforehand. For example, a ZSL-based model would be able to automatically learn and diagnose COVID-19 patients, based on the existing chest X-ray images of patients with asthma and lungs inflammatory diseases which are already recognised and labelled by clinicians, plus some new auxiliary information about the COVID-19 attributes. Here, the auxiliary data can be the description of physicians and clinicians about the unique type of visual patterns, features, damages, or differences they have noticed on the Chest X-ray of patients with positive COVID-19 comparing to asthma X-ray images. A similar concept or approach is applicable in autonomous vehicles, [132] , where a self-driving car is responsible for automatic detection of surrounding cars including e.g. an unseen Tesla concept car based on the subgroup of labelled classic sedan cars plus auxiliary information about the common differences of concept cars than the classic cars; or recognising a Persian deer, based on the auxiliary information available for it and its appearance similarities or differences with other previously known deer. For instance, it belongs to a subgroup of the fallow deer, but with a larger body, bigger antlers, white spots around the neck, and also flat antlers for the male type. Figure 1(a) shows three examples of Posterior-Anterior (PA) and AP projection of chest X-rays of positive cases of COVID-19, and Figure 1 (b) represents their corresponding axial CT scans, taken from the COVID-ChestXRay dataset [27] . As it can be seen in the images, common evident anomalies may include unilateral or bilateral patchy ground-glass opacities (GGOs), patchy consolidations and parenchymal thickening. The goal of this research is to build an artificial intelligence based-model that can diagnoses COVID-19 without providing any visual exemplars in the training phase. In that case, the side (auxiliary) information should be provided to assist diagnosis in the test phase. In Figure 2 , the auxiliary information is provided in the form of textual descriptions for two examples of concept cars and COVID-19 X-rays. In Figure 2 (a) we aim at distinguishing new unseen concept cars (bottom row), using description on the exterior of the target and how it differs an already learned car from existing classic vehicle classification system such as in [133] . Similarly, visual differences and similarities between healthy Chest X-rays, Asthma cases, and COVID-19 positive cases are described in Figure 2 (b) as the auxiliary information.

Let's assume our pre-trained AI-based medical imaging system is capable of detecting Asthma cases, based on common deep learning techniques using a previously large dataset of labelled Asthma Chest X-ray images. However, these days we are facing an unknown COVID-19 pandemic with very limited annotated Chest X-rays. Obviously, we can not proceed on the same way of training traditional deep-learning methods, due to very sparse labelled images for COVID-19. The good point is that our medical experts and clinicians can provide some auxiliary information (textual descriptions) about common features and similarities among the COVID-19 positive chest X-rays to infer their findings. In Figure 3 , the side information is provided in form of what "attributes": such as foggy effects, white spot features, blurred edges, and white/low-intensity pixel dominance in various areas of the chest X-ray images of COVID-19 patients.

Our idea behind the utilisation of ZSL models is to detect, understand, and recognise new concepts using an existing similar deep-learning based classifier, plus the integration of auxiliary information. This turns it to a completely new and efficient detector/recogniser or diagnosing system without the requirement of collecting a new dataset and a vast amount J o u r n a l P r e -p r o o f (a) Concept cars auxiliary information: "The body of the car has a singular and unified shape with smoother curves. The wheels' colour, curves, and design match the body as a singular integrated piece. LED lights are omnipresent all around the car."

(b) COVID-19 X-ray auxiliary information: "Bilateral multifocal patchy GGOs and consolidation can be seen. Edges are blurred and the intensity sharpness of both lungs have decreased."

of costly and time-consuming labelling, especially when a speedy solution is crucial and life-saving; such as the recent global pandemic.

In this research we will have four main contributions, as follows:

• We propose to categorise the reviewed approaches based on the embedding spaces that each model uses to learn/infer unseen objects/concepts as well as describing the variations to the data embedding inside those embedding spaces ( Figure 3 and Table 1 ).

• We evaluate the performance of the state-of-the-art models on famous benchmark datasets (Table 3 -5, Fig. 4 ). To the best of our knowledge, we are the first to include the evaluation of data-synthesising methods in the research field of applied Zero-shot learning.

• We study the motivation behind leveraging each space as a way to solve the ZSL challenge by reviewing current issues and solutions to them.

• We provide sufficient technical justifications to support the ideas of using the proposed ZSL model as one of the best practices for COVID-19 diagnosis and other similar applications.

The rest of the materials in the article is organised as follows. In Section 2, we introduce the problem of Few-shot, One-shot and Zero-shot learning. In Section 3, we discuss about the test and train phases of the Zero-shot learning and generalised Zero-shot learning systems. Section 4 provides with embedding approaches followed by evaluation protocols in Section 5. In Section 6, we analyse the outcome of the experiments performed on different state-of-the-art methodologies. Further discussion about the applications of ZSL is investigated in Section 7. In Section 8, we discuss the outcome of this research, and finally, the concluding remarks in Section 9.

Few-shot learning (FSL) is the challenge of learning novel classes with a tiny training dataset of one or a few images per category. FSL is closely related to knowledge transfer where a model, previously trained on large data, is used for a similar task with fewer training data. The more the transferred knowledge is accurate, the better FSL will generalise. Moreover, many approaches employ meta-learning to learn the challenge of fewshot or few-example learning [156] , [64] . The main challenge is to improve the generalisation ability as it often faces the overfitting problem.

In this type of learning, there is an auxiliary dataset that contains N classes each having K annotated samples of the new examples in the training phase. This makes the problem into a N-way-K-shot classification: (1) where x i is the i th training example and y i is its corresponding label. N t = K × N denotes the number of N categories and K defines the number of examples. Few-shot learning has K > 1 samples.

Among the relevant research works, [163] use the shared features among classes to compensate for the requirement for the large data, and follows a learning procedure based on boosted decision stumps. HDP-DBM [141] develops a compound of a deep Boltzmann machine and a hierarchical Dirichlet process to learn the abstract knowledge at different hierarchies of the J o u r n a l P r e -p r o o f concept categories. [156] Proposes prototypical networks that computes Euclidean distance between prototype representations of each class. It was not until recently that Few-shot learning was introduced in computer-aided diagnosis. For the first time, the idea of using additional information (attributes) in FSL, was introduced in [165] . [121] proposes a model to classify skin lesions. [68] use FSL for Glaucoma Diagnosis from fundus images. [127] study the problem of chest X-ray classification of five symptoms including Consolidation.

In the case of one-shot learning, there is only K = 1 example per class in the supporting set, thus it faces more challenge in comparison to the FSL. Bayesian Program Learning (BPL) framework [77] present each concept of the handwritten characters as a simple probabilistic program. [14] proposes cross-generalisation algorithm. It replaces the features from the previously learned classes with similar features of the novel classes to adapt to the target task. In Bayesian learning, [41] depicts prior knowledge in the form of probability density function on the parameters of the model, and updates them to compute the posterior model. Matching Nets (MN) [172] uses non-parametric attentional memory mechanisms, and an "episode" during the training time. [25] capture salient features of general lung datasets using an encoder and augment multiple views for images, then use the prototypical network for a 2-way, 1-shot classification.

Zero-shot learning is the extreme case of the FSL where K = 0. In other words, the difference between the two is the devoid of any visual examples of the target classes in the training phase of ZSL, while in few-shot learning, the support set contains few labelled samples of the novel categories. Also, auxiliary information in the form of class embeddings is one of the main components of Zero-shot learning. ZSL approaches might extend their solutions to one-shot or few-shot learning by either updating the training data with one or few generated samples from augmentation techniques, or by having access to a few of the unseen images during the training time [145] , [199] , [147] , [5] , [59] , [17] , [171] [23] , [164] , [189] , [145] . [189] and [145] both use auxiliary text-based information.

ZSL models can be seen from two points of views in terms of training and test phase: Classic ZSL and Generalised ZSL (GZSL) settings. In the classic ZSL settings, the model only detects the presence of new classes at the test phase, while in GZSL settings, the model predicts both unseen and seen classes at the test time; hence, GZSL is more applicable for real-world scenarios [94] , [75] , [210] , [86] , [145] . The same idea can be applied to FSL to train in the generalised model, called generalised few-shot learning (GFSL) that detects both known and novel classes at the test time.

In the next paragraphs, we discuss two types of training approaches: Inductive vs. Transductive training.

Inductive Training: This training setting only uses the seen class of information to learn a new concept. The training data for the inductive setting is:

where x represents image features, y is the class labels, and c(y) denotes the class embeddings. Moreover, X S and Y S indicate seen class images and seen class labels, respectively. Inductive learning accounts for the majority of the settings used in ZSL and Generalised Zero-Shot Learning (GZSL). e.g. in [7] , [43] , [113] , [137] , [52] , [204] , [22] , [207] , [90] , [171] , [189] .

Transductive Training: Although the original idea of zeroshot learning is more related to the inductive setting, in many scenarios, the transductive setting is used where either unlabelled visual or textual information, or both for unseen classes are used together with the seen class data e.g. in in [134] , [71] , [44] , [6] , [205] , [192] , [52] , [207] , [90] , [171] , [159] , [174] , [189] , [143] . The training data for transductive learning is:

where X S∪U denotes that images come from the union of seen and unseen classes. Similarly, Y S∪U and C S∪U indicate the train labels and class embeddings belong to both seen and novel categories. According to [197] , any approach that relies on label propagation will fall into the category of transductive learning. Feature generating network with labelled source data and unlabelled target data [189] are also considered as transductive methods. The transductive setting is seen as one of the solutions to the domain shift problem, since the provided unseen labelled information during training reduces the discrepancy between the two domains.

There is a slight nuance between the transductive learning and semi-supervised learning; in the transductive setting, the unlabelled data solely belong to the unseen test classes, while in semi-supervised setting, unseen test classes might not be present in the unlabelled data. Furthermore, the difference between FSL and the transductive ZSL learning is the existence of a few labelled examples of the unseen classes alongside annotated seen class examples in the few-shot learning. While in the transductive ZSL setting, the examples for the unseen classes are all unlabelled.

ZSL models are developed based on two high-level major strategies to be taken into account: a) defining the "Embedding Space" to combine visual and non-visual auxiliary data, and b) choosing an appropriate "Auxiliary Data Collection" technique. a) Embedding Spaces. Figure 3 demonstrates the overall structure of a ZSL system in terms of embedding spaces and auxiliary data types collection techniques. Such systems either map the visual data to the semantic space (Figure 3 .a) or embed both visual and semantic data to a common latent space (Figure 3 .b), or see the task as a missing data problem, and then map the semantic information to the visual space (Figure 3 .c). Two or all of these approaches can also be combined and embedded together to boost up the benefits of each individual categories.

From a different point of view, semantic spaces can also be sub-categorised into euclidean and non-euclidean spaces. The intrinsic relationship between data points is better preserved when the geometrical relation between them is considered. These spaces are commonly based on clusters or graph networks. Some researchers may prefer manifold learning for the ZSL challenge. e.g. in [134] , [175] , [207] , [192] , [193] , [91] , [181] , [83] , [63] , [210] . The Euclidean spaces are more conventional and simpler as the data has a flat representation in such spaces. However, the loss of information is a common issue of these spaces, as well. Examples of methods using Euclidean spaces are [80] , [43] , [137] , [187] , [106] , and [145] . b) Auxiliary Data Collection. As mentioned before, Zero-shot learning is the challenge of learning novel classes without seeing their exemplars during the training. Instead, the freely available auxiliary information is used to compensate for the lack of visually labelled data. Such information can be categorised into two groups:

Human annotated attributes. The supervised way of annotating each image with its related attributes is an arduous process and requires time and expertise, but since they are manual, they yield noiseless and important attributes needed for learning and inference. There are several datasets in which side information in the form of attributes can be attained for each image. i.e. aPY [40] , AWA1 [80] , AWA2 [188] , CUB [173] , and SUN [118] . Several ZSL methods leverage the attributes as the side information [7] , [137] , [97] , or visual attributes [79] , [40] .

Unsupervised auxiliary information. There are several forms of auxiliary information that have minimum supervision and are widely used in the ZSL setting, such as human gazes [66] , WordNet which is a large-scale lexical database of 117,000 English words [136] , [135] , [5] , [7] , [185] , [100] , [4] , [123] , [102] , [181] , [83] , or Textual descriptions such as Web search [135] , Wikipedia articles [43] , [113] , [37] , [7] , [84] , [4] , [123] , [38] , [112] , [211] , and sentence descriptions [129] . Textual side information needs to be transformed into class embeddings in order to be used at the training stage and testing stages. Word embedding and language embedding are the two representation techniques used for textual side information. As we gradually proceed, later we review on different embedding classes as well.

In this section, we first provide the task definition of ZSL and GZSL. Then we review the four recent approaches on the problem.

In the standard inductive setting as mentioned earlier in Section 3, the training set is

and the objective function to be minimised is as follows:

where, f (x, y; W) = argmax y ∈Y F(x, y) is the mapping function.

Through the training phase, the classifier f : X → Y U is learned for ZSL to predict only the novel classes at the test time, or f : X → Y U ∪ Y S for the GZSL challenge to estimate both novel classes and the previously learned seen classes. For instance, the classifier f can be a COVID-19 diagnoser.

We categorise the embedding methodologies into four categories based on the space they learn/infer target classes (like COVID-19 detection in Figure 3 ): The majority of methods focus on the general tasks; however, they are scalable to disease classification.

Semantic embedding itself can be sub categorised into two tasks of Attribute Classification and Label Embedding which will be discussed here:

Primitive approaches of Zero-Shot learning leverage manually annotated attributes in a two-stage learning schema. Attributes in an image are predicted in the first stage and labels of unseen classes would be chosen using similarity measures in the second stage. [79] use a probabilistic classifier to learn the attributes and then estimates posteriors for test classes. [136] propose a method to avoid manual supervision with mining the attributes in an unsupervised manner. [135] adopt DAP together with a hierarchy-based knowledge transfer for large-scale settings. [65] 's method is based on IAP, and uses Self-Organising and Incremental Neural Networks (SOINN) to learn and update attributes online. Later in IAP-SS by [65] , an online incremental learning approach is used for faster learning of the new attributes. The Direct Attribute Prediction (DAP) [80] first learns the posteriors of the attributes, then estimates the posteriors of seen classes. On the other hand, Indirect Attribute Prediction (IAP) [80] first learns the posteriors for seen classes then uses them to compute the posteriors for the attributes. [179] use a unified probabilistic model based on the Bayesian Network (BN) [110] that discovers and captures both object-dependent and object-independent relationships to overcome the problem of relating the attributes. CONSE [113] learns the probability of the training samples. It then predicts an unseen class by the convex combination of the class label embedding vectors. [59] use a random forest approach for learning more discriminative attributes. Hierarchy and Exclusion (HEX) [31] considers relations between objects and attributes and maps the visual features [161] , [130] of the images to a set of scores to estimate labels for unseen categories. [8] take on an unsupervised approach where they capture the relations between the classes and attributes with a three-dimensional tensor while using a DAP-based scoring function to infer the labels. LAGO by [12] also follow the DAP model. It learns soft and-or logical relations between attributes. Using soft-OR, the attributes are divided into groups, and the label class from unseen samples is predicted via a soft-AND within these groups. If each attribute comes from a singleton group, the all-AND will be used.

Instead of using an intermediate step, more recent approaches learn to map images to the structured euclidean semantic space automatically which would be the implicit way of representing knowledge. The compatibility function for linear mapping is:

where θ(x) T is the image embedding for training classes and w is the parameters in vector form to be learned. In the case of bilinear projection where it is more common, w takes the form of matrix:

SOC [114] first maps the image features to the semantic embedding space, it then estimates the correct class using nearest neighbour. DeViSE by [43] uses a linear corresponding function with a combination of dot-product similarity and hinge rank loss used in [183] . ALE [6] optimises the ranking loss in [167] alongside the bi-linear mapping compatibility function. SJE [7] learns a bi-linear compatibility function using the structural SVM objective function [166] . ESZSL [137] introduces a better regulariser and optimises a close form solution objective function in a linear manner. ZSLNS [123] proposes a l 1,2 -norm based loss function. [17] take on a metric learning approach and linearly embed the visual features to the attribute space. LAGO [12] is a probabilistic model that depicts soft and-or relations between groups of attributes. In a case where all attributes form all-OR group, It becomes similar to ESZSL [137] and learns a bilinear compatibility function. AREN [190] uses attentive region embedding while learning the bilinear mapping to the semantic space in order to enhance the semantic transfer. ZSLPP [38] combines two networks VPDE-net for detecting bird parts from images and PZSC-net that trains a part-based Zero-Shot classifier from the noisy text of the Wikipedia. DSRL [197] uses non-negative sparse matrix factorisation to align vector representations with the attribute-based label representation vectors so that more relevant visual features are passed to the semantic space. Some approaches to ZSL use non-linear compatibility functions. CMT [157] use a two-layer neural network, similar to common MLP networks by [131] alongside the compatibility function. In UDA [71] a non-linear projection from feature space to semantic space (word vector and attribute) is proposed in an unsupervised domain adaptation problem based on regularised sparse coding. [84] use a deep neural network [161] regression which generates pseudo attributes for each visual category via Wikipedia. LATEM [185] constructs a piece-wise non-linear compatibility function alongside a ranking loss. [23] regularise the model using structural relations of the cluster by which cluster centres characterise visual features. QFSL by [159] solves the problem in a transductive setting, and projects both sources and target images into several specified points to fight bias problem.

GFZSL [171] uses both linear and non-linear regression models and generates a probability distribution for each class. For transductive setting, it uses Expectation-Maximisation (EM) to estimate a Gaussian Mixture Model (GMM) of unlabelled data in an iterative manner.

Leveraging the non-euclidean spaces to capture the manifold structure of the data is another approach to the problem.

Together with the knowledge graphs, the explicit relations between the labels will be demonstrated. In this setting, the side information mainly comes from a hierarchy ontology like WordNet. The mapping function will have the following form:

where X is the n × k feature matrix and A is the adjacency matrix of the graph. Propagated Semantic Transfer (PST) [134] first uses DAP model to transfer knowledge to novel categories, following the graph-based learning schema, it improves local neighbourhood in them. DMaP [91] jointly optimises the projecting of the visual features and the semantic space to improve the transferability of the visual features to the semantic space manifold. MFMR [193] decomposes the visual feature matrix into three matrices to further facilitate the mapping of visual features to the semantic spaces. To improve the representation of the geometrical manifold structure of the visual and semantic features, manifold regularisation is used. In [83] a Graph Search Neural Network (GSNN) [102] is used in the semantic space based on the WordNet knowledge graph to predict multiple labels per image using the relations between them.

[181] distils both auxiliary information in forms of word embedding and knowledge graph to learn novel categories. DGP [63] proposes dense graph propagation to propagate knowledge directly through dense connections. In [210] a graphical model with a low dimensional visually semantic space is utilised which has a chain-like structure to close the gap between the high-dimensional features and the semantic domain.

One of the methods of embedding is to measure the similarity between the visual and semantic features in a joint space.

Considering unseen classes as a fusion of previously learned seen concepts is called hybrid learning. Standard scoring function for hybrid models is defined as:

SSE [204] considers the histogram similarity between the seen class auxiliary information and seen visual data. In SYNC [22] uses two spaces of semantic and model space, and the alignment is conducted with phantom classes. With the sparse linear combination of the classifiers for the phantom classes, the final classifier is learned. TVSE [192] learns a latent space using collective matrix factorisation with graph regularisation to incorporate the manifold structure between source and target instances, moreover, it represents each sample as a mixture of seen class scores. LDF [93] combines the prototypes of seen classes and jointly learns embeddings for both user-defined attributes and latent attributes.

Inferring unseen labels via measuring similarity between cross-modal data in a shared latent space is another workaround to the ZSL challenge. The first term in the objective function for standard cross-modal alignment approaches is:

with Y being a One-hot vector of corresponding class labels and . 2 F is the Frobenius norm. Approaches to joint space learning are grouped into two categories, Parametric which follow a slow learning via optimising a problem and Nonparametric that leverage data points extracted from neural networks in a shared space. In parametric methods including [44] a multi-view alignment space is proposed for embedding low-level visual features. The learning procedure is based on the multi-view Canonical Correlation Analysis (CCA) [47] . [100] applies PCA and ICA embeddings to reveal the visual similarity across the classes and obtains the semantic similarity with the WordNet graph, followed by embedding the two outputs into a common space. MCZSL [4] uses visual part and multi-cue language embedding in a joint space. In [108] both images and words are represented by Gaussian distribution embeddings. JLSE [205] decides on a dictionary learning approach to learn the parameters of source and target domains across two separate latent spaces where the similarity is computed by the likelihood of similarity independent to the class label. CDL [61] uses a coupled dictionary to align the structure of visual-semantic space using discriminative information of the visual space. In [73] and [138] a coupled sparse dictionary is leveraged to relate visual and attribute features together. It uses entropy regularisation to alleviate the domain shift problem.

There are several non-parametric methods. ReViSE [164] that combines auto-encoders with Maximum Mean Discrepancy (MMD) loss [49] in order to align the visual and textual features. DMAE [109] introduces a latent alignment matrix with representations from auto-encoders optimised by kernel target alignment (KTA) [29] and squared-loss mutual information (SMI) [195] . DCN [94] proposes a novel Deep Calibration Network in which an entropy minimisation principle is used to calibrate the uncertainty of unseen classes as well as seen classes.

To narrow the semantic gap, BiDiLEL [176] introduces a sequential bidirectional learning strategy and creates a latent space using the visual data, then the semantic representations of unseen classes are embedded in the previously created latent space. This method comprises both parametric and non-parametric models.

Visual embedding is the other type of ZSL methods that performs classification in the original feature space and is orthogonal to semantic space projection. This is done by learning a linear or non-linear projection function. For linear corresponding functions, WAC-Linear [37] uses textual description for seen and unseen categories and projects them to J o u r n a l P r e -p r o o f the visual feature space with a linear classifier. [207] follows a transductive setting in which it refines unseen data distributions using unseen image data. To approximate the manifold structure of data, they used a global linear mapping for synthesising virtual cluster centres. [52] assigns pseudo labels to samples using reliability (with robust SVM) and diversity (via diversity regularisation). For learning Ia Non-linear corresponding function, In WAC-Kernel [36] in order to leverage any kind of side information, a kernel method is proposed to predict a kernel-based on the representer theorem [144] . DEM [202] uses the least square embedding loss to minimise the discrepancy between the visual features and their class representation embedding vector in the visual feature space. OSVE [96] reversely maps from attribute space to visual space then trains the classifier using SVM [11] . In [60] the authors introduce a stacked attention network that corporates both global and local visual features weighted by relevance along with the semantic features. In [174] visual constraint is used in class centres in the visual space to avoid the domain shift problem.

There are a variety of generative networks that augment unseen data, taking GAN [48] as an example, the first term in objective function would be:

x = G(z, c(y)) is the synthesised data of the generator and z ∈ R d z is random Gaussian noise. The role of the discriminator D and generator G contradicts in loss function as the first one attempts to maximise the loss while the latter tries to minimise it. Another widely used generative neural network is the Variational AutoEncoder (VAE) [69] :

The first term is the reconstruction loss, and the latter is the Kullback-Leibler divergence that works as a regulariser. RKT [175] leverages relational knowledge of the manifold structure in the semantic space, and generates virtually labelled data for unseen classes from Gaussian distributions generated by sparse coding. Then it projects them alongside the seen data to the semantic space via linear mapping. GLaP [90] generates virtual instances of an unseen class with the assumption that each representation obeys a prior distribution where one can draw samples from. To ease the embedding to the semantic space, GANZrl [162] proposes to increase the visual diversity by generating samples with specified semantics using GAN models. SE-GZSL [75] uses a feedback-driven mechanism for its discriminator that learns to map the produced images to the corresponding class attribute vectors. To enforce the similarity of the distribution of the sample and generated sample, a loss component was added to the VAE objective [69] function.

Synthesised images often suffer from looking unrealistic since they lack intricate details. A way around this issue is to generate features instead. [18] uses a GMMN model [89] to generate visual features for unseen classes. In [42] a multi-modal cycle consistency loss is used in training the generator for better reconstruction of the original semantic features. CVAE-ZSL [106] takes attributes and generates features for the unseen categories via a Conditional Variations AutoEncoder (CVAE) [158] . L 2 norm is used as the reconstruction loss. GAZSL [211] utilises noisy textual descriptions from Wikipedia to generate visual features. A visual pivot regulariser is introduced to help generate features with better qualities. f-CLSWGAN [187] combines three conditional GAN variants for a better data generation. f-VAEGAN-D2 [189] combines the architectures of conditional VAE [158] , GAN [48] and a non-conditional discriminator for the transductive setting. LisGAN [87] generates unseen features from random noises using conditional Wasserstein GANs [9] . For regularisation, they introduced semantically meaningful soul samples for each class and forced the generated features to be close to at least one of the soul samples. Gradient Matching Network (GMN) [143] trains an improved version of the conditional WGAN [51] to produce image features for the novel classes. It also introduces Gradient Matching (GM) loss to improve the quality of the synthesised features. In order to synthesise unseen features, SPF-GZSL [86] selects similar instances and combines them to form pseudo features using a centre loss function [182] . In Don't Even Look Once (DELO) by [209] a detection algorithm is conducted to synthesise unseen visual features to gain high confidence predictions for unseen concepts while maintaining low confidence for backgrounds with vanilla detectors.

Instead of augmenting data using synthesising methods, data can be acquired by gathering web images. [112] jointly use web data which are considered weakly-supervised categories alongside the fully-supervised auxiliary labelled categories. It then learns a dictionary for the two categories.

Several works make use of both visual and semantic projections to reconstruct better semantics to confront domain shift issue by alleviating the contradiction between the two domains. Semantic AutoEncoder (SAE) [72] adds a visual feature reconstruction constraint. It combines linear visual-tosemantic (encoder) and linear semantic-to-visual (decoder). SP-AEN [24] is a supervised Adversarial AutoEncoder [101] which improves preserving the semantics by reconstructing the images from the raw 256 x 256 x 3 RGB colour space. BSR [153] uses two different semantic reconstructing regressors to reconstruct the generated samples into semantic descriptions. CANZSL [26] combines feature-synthesis with semantic embedding by using a GAN for generating visual features and an inverse GAN to project them into semantic space. In this way, the produced features are consistent with their corresponding semantics.

Some of the synthesising approaches utilise a common latent space to align the generated features space with the semantic space to facilitate capturing the relations between the two spaces. [97] introduce a latent-structure-preserving space Table 1 . Common ZSL and GZSL methods categorised based on their embedding space model, with further divisions in a top-down manner.

Semantic Embedding Two-Step Learning Attributes classifiers DAP-Based [79] , [136] , [135] , [80] , [8] , [12] IAP-Based [79] , [65] , [80] , [113] Bayesian network (BN) [179] , Random Forest Model [59] , HEX Graph [31] Direct Learning

Linear [114] , [43] , [6] , [7] , [137] , [123] , [17] , [12] , [190] , [38] , [197] , [171] or Non-Linear [71] , [84] , [185] , [23] , [159] , [171] Compatibility Functions Explicit knowledge representation Graph Conv. Networks (GCN) [181] , Knowledge Graphs [83] , [134] , [91] , [63] , 3-Node Chains [210] , Matrix Tri-Factorisation with Manifold Regularisation [193] Cross-Modal Latent Embedding

Fusion-based Models Fusion of seen class data Combination of seen classes properties [204] , [22] , [93] , Combination of seen class scores [192] Common Representation Space Models

Mapping of the visual and semantic spaces in a joint intermediate space

Parametric [44] , [100] , [4] , [108] , [205] , [61] , [73] , [138] , Non-parametric [164] , [109] , [94] , or Both [176] Visual Embedding

Learning of the semantic to visual projection

Linear [37] , [207] , [52] or Non-linear [36] , [202] , [96] , [60] , [174] Projection functions Data Augmentation

Gaussian distribution [175] , [90] , GAN [162] , VAE [75] Visual feature generation GAN [42] , [211] , [87] , WGAN [187] , [143] , CVAE [106] , [209] , VAE+GAN [189] , GMMN [18] , Similar feature combination [86] Leveraging Web Data Web images crawling Dictionary learning [112] Hybrid Visual+Semantic Embedding

AutoEncoder [72] , Adversarial AutoEncoder [24] , GAN with two reconstructing regressors [153] , GAN an inverse GAN [26] Visual+Cross Modal Embedding Feature generation with aligned semantic features Semantic to visual mapping [97] , VAE [145] All The use of generator and discriminator together with the regressor GAN + Dual Learning [58] where synthesised features from given attributes would suffer less from bias and variance decay with the help of Diffusion Regularisation. CADA-VAE [145] generates a visual feature latent space where both of visual and semantic features are embedded in this space by a VAE [69] . It uses Distribution Alignment (DA) loss and Cross-Alignment (CA) loss to align the cross-modal latent distributions.

GDAN [58] combines all three approaches and designs a dual adversarial loss. In this way, regressor and discriminator learn from each other.

A summary of the different approaches is reported in Table 1 . The number of methods are growing with time and we can interpret that some areas like direct learning, common space learning and visual data synthesising are more popular in solving the task, while models combining different approach are fairly newer techniques thus have fewer works that are reported here.

In this section, we review some of the standard evaluation techniques to analyse the performance of the ZSL techniques based on the common benchmark datasets in the field, also in terms of dataset splits, class embeddings, image embeddings, and various evaluation metrics. First, the benchmark datasets.

There are several well-known benchmark datasets for Zero-shot learning that are frequently used. North America Birds (NAB) [168] is a fine-grained dataset of birds consisting of 1,011 classes and 48,562 images. Images are categorised based on their visual attributes. A new version of this dataset is proposed by [38] in which the identical leaf nodes are merged to their parent nodes where their only differences were genders and resulted in final 404 classes.

Attribute datasets. SUN Attribute [118] is a mediumscale and fine-grained attribute dataset consisting of 102 attributes, 717 categories and a total of 14,340 images of different scenes. CUB-200-2011 Birds (CUB) [173] is a 200 category fine-grained attribute dataset with 11,788 images of bird species that includes 312 attributes. Animals with Attributes (AWA1) [80] is another attribute dataset of 30,475 images with 50 categories and 85 attributes, the image features in this dataset are licensed and not available publicly. Later, Animals with Attributes 2 (AWA2) was presented by [188] which is a free version of AWA1 with more images than the previous one (37,322 images), with the same number of classes and attributes, but different images. aPascal and Yahoo (aPY) [40] is a dataset with a combination of 32 classes, including 20 pascal and 12 yahoo attribute classes with 15,339 images and 64 attributes in total.

A summary of the statics for the attribute datasets are gathered in Table 2 .

ImageNet. ImageNet [32] is a large-scale dataset that contains 14 million images, shared between 21k categories with each image having one label that makes it a popular benchmark to evaluate models in real-world scenarios. Its organisation is based on WordNet hierarchy [105] . ImageNet is imbalanced between classes as the number of samples in each class vary greatly and is partially fine-grained. A more balanced version has 1k classes with 1000 images in each category.

There are several approaches in FSL setting for COVID-19 diagnosis, however ZSL is still new in the field of disease recognition, we introduce a dataset suited for the task of ZSL/GZSL that contains the required image and textual descriptions in one place. [27] is a small and public dataset of CXR and CT scans suitable for ZSL and Few-shot learning experiences. At the time of this research, it had 444 unique clinical notes for a total of 16 categories, from no finding (normal cases) to other pneumonic cases like COVID-19, MERS, and SARS.

Here we discuss the original splits of the datasets as well as the other splits proposed for the Zero-shot problem.

In ZSL problems, unseen classes should be disjoint to seen classes and test time samples limited to unseen classes, thus the original splits aim to follow this setting. SUN [118] proposed to use 645 classes for training among which 580 of the classes are used for training, 65 classes are for validation and the remaining 72 classes will be used for testing. For CUB, [6] introduces the split of 150 training classes (including 50 validation classes) and 50 test classes. As for AWA1, [80] introduced the standard split of 40 classes for training (13 validation classes) and 10 classes for testing. The same splits are used for AWA2. In aPY, 20 classes of Pascal are used for training (15 classes for training and 5 for validation), while the 12 classes of Yahoo are used for testing.

Proposed Splits (PS). The standard split images from SUN, CUB, AWA1 and aPY overlap with some images of pre-trained ResNet-101 ImageNet model. To solve the problem, proposed splits (PS) is introduced by [186] where no test images are contained in the ImageNet 1K dataset.

ImageNet. [186] proposes 9 ZSL splits for the ImageNet dataset; two of which evaluate the semantic hierarchy in distance-wise scales of 2-hops (1509 classes) and 3-hops (7678 classes) from the 1k training classes. The remaining six splits consider the imbalanced size of classes with increasing granularity splits starting from 500, 1K and 5K least-populated classes to 500, 1K and 5K most-populated classes, or All which denotes a subset of 20k other classes for testing.

To measure the relatedness of seen samples to unseen classes, [38] introduces two splits Super-Category-Shared (SCS) and Super-Category-Exclusive (SCE). SCS is the easy split since it considers the relatedness to the parent category while SCE is harder and measures the closeness of an unseen sample to that particular child node.

There exist several class embeddings, each suitable for a specific scenario. Class embeddings are in forms of vectors of real numbers which can further be used to make predictions based on the similarity between them and can be obtained through three categories: attributes, word embeddings, and hierarchical ontology. The last two are done in an unsupervised manner thus do not require human labour.

Human annotated attributes are done under the supervision of experts with a great amount of effort. Binary, relative and real-valued attributes are three types of attributes embeddings. Binary attributes depict the presence of an attribute in an image thus value is either 0 or 1. They are the easiest type and are provided in benchmark attribute datasets AWA1, AWA2, CUB, SUN, aPY. Relative attributes [115] on the other hand, show the strength of an attribute in a given image comparing to the other images. The real-valued attributes are in continuous form thus they have the best quality [7] . In the SUN attribute dataset [118] , they have achieved confidence through averaging the binary labels from multiple annotators.

Also known as Textual corpora embedding. Bag of Words (BOW) [54] is a one-hot encoding approach. It simply shows the number of occurrences of the words in a representation J o u r n a l P r e -p r o o f called bag and is negligent of word orders and grammar. One-hot encoding approaches had a drawback of giving the stop words (like "a", "the" and "of") high relevancy counts. Later, Term Frequency-Inverse Document Frequency (TF-IDF): [142] used term weighting to alleviate this problem by filtering the stop words and to keep meaningful words. Word2Vec [103] , a widely used two-layered neural embedding model and has two variants, CBOW and skip-gram. CBOW predicts a target word in the centre of a context using its surroundings words while the skip-gram model predicts surrounding words using a target word. CBOW is faster in train and usually results in better accuracy for frequent words while Skip-gram is preferred for rare words and it works well with sparse training data. Global Vectors (GloVe) [119] is trained on Wikipedia. It combines local context window methods and global matrix factorization. Glove learns to consider global word-word co-occurrence matrix statistics to build the word embeddings.

WordNet [105] is a large-scale public lexical database of 117,000 synsets. Synsents are a group of words that are semantically related to each other. i.e. synonyms, homonyms and meronymies of English words that are organised using the hierarchy distances with a graph structure, thus Approaches based on knowledge graphs often follow the WordNet to measure the similarity between the word meanings [136] , [135] , [5] , [7] , [185] , [100] , [4] , [123] , [102] , [181] , [83] .

In the general ZSL scenarios, word by word representations considered; however, with the advent of transfer learning in the natural language processing (NLP), and the introduction of contextual word embeddings, the boundaries of the capabilities of the embeddings has been pushed further. Unlike the traditional word embeddings, language models can capture the meaning of the words based on the context in which they appear. Several contextual representations that have been introduced recently and showed great results. These existing pre-trained models can be fined tuned on various ZSL tasks.

ELMo [120] is a contextual embedding model. Following morphological clues together with a deep bidirectional language model (biLM), ELMo learns the representations. Bidirectional Encoder Representations from Transformer (BERT) [33] is a multi-layer bidirectional Transformer encoder [170] trained upon BooksCorpus [212] dataset and English Wikipedia. It outperforms ELMo with having more parameters and layers.The pre-trained BERT model can be fine-tuned with just one additional output layer. However, BERT suffers from fine-tuning discrepancy due to ignoring the relation the masked positions have. XLNet [196] uses an autoregressive model to introduce a method that overcomes the shortcoming of BERT. In addition to the datasets used by BERT, XLNet pre-trains the model on Giga5 [116] , ClueWeb 2012-B extended by [20] and Common Crawl * . ALBERT [81] increases the model size. It lowers the memory usage with two parameter reduction techniques. The first one is a factorized embedding parameterization. The second one is cross-layer parameter sharing. These two techniques result in lower memory usage and higher training speed than BERT. The data used for pre-training is the same as XLNeT.

In this article, we report the results of ZSL and GZSL using the same class embeddings as [186] that is Word2Vec trained on Wikipedia for ImageNet and per-class attributes for the attribute datasets, and for the seen-unseen relatedness task we follow [38] and consider TF-IDF for the CUB and NAB datasets.

Existing models use either shallow or deep feature representation. Examples of shallow features are SIFT [99] , PHOG [16] , SURF [15] and local self-similarity histograms [148] . Among the mentioned features, SIFT is the commonly used features in ZSL models like [6] , [22] and [44] .

Deep features are obtained from deep CNN architectures [161] and contain higher-level features. Extracted features are one of the followings: 4,096-dim top-layer hidden unit activations (fc7) of the AlexNet [74] , 1000-dim last fully connected layer (fc8) of VGG-16 [155] , 4,096-dim of the 6th layer (fc6) and 4,096-dim of the last layer (fc7) features of the VGG-19 [155] . 1,024-dim top-layer pooling units of the GoogleNet [160] . and 2048-dim last layer pooling units of the ResNet-101 [55] .

In this paper, we consider the ResNet-101 network which is pre-trained on ImageNet-1K without any fine-tuning. That is the same image embedding used in [186] . Features are extracted from whole images of SUN, CUB, AWA1, AWA2, and ImageNet and the cropped bounding boxes of aPY. For the seen-unseen relatedness task, VGG-16 is used for CUB and NAB as proposed in [38] .

Common evaluation criteria used for ZSL challenge are:

Classification accuracy. One of the simplest metrics is classification accuracy in which the ratio of the number of the correct predictions to samples in class c is measured. However, it results in a bias towards the populated classes.

Average per-class accuracy. To reduce the bias problem for the populated classes, average per-class accuracies are computed by multiplying the division of the classification accuracy to division of their cumulative sum.

#correct predictions in class y #samples in class y [188] (13)

Harmonic mean. For performance evaluation on both seen and unseen classes (i.e. the GZSL setting), the Top-1 accuracies for the seen and unseen classes are used to compute the harmonic mean:

In this paper, we designate the Top-1 accuracies and the harmonic mean as the evaluation protocols.

As the main contributions of this research, and for the first time, we provide a comprehensive experiments of 21 state-of-the-art models in ZSL/GZSL domain that include the evaluations and comparisons of data-synthesising methods. In this section, first we provide the results for ZSL, GZSL and seen-unseen relatedness on attribute datasets, then we present the experimental results on the ImageNet dataset. A minor part of the results is reported from [188] for a more comprehensive comparison.

For the original ZSL task where only unseen classes are being estimated during the test time, we compare 21 state-of-the-art models in Table 3 , among which, DAP [80] , IAP [80] and CONSE [113] belong to attribute classifiers. CMT [157] , LATEM [185] , ALE [6] , DeViSE [43] , SJE [7] , ESZSL [137] , GFZSL [171] and DSRL [197] are from compatibility learning approaches, SSE [204] and SYNC [22] are representative models of cross-modal embedding, DEM [202] , GAZSL [211] , f-CLSWGAN [187] , CVAE-ZSL [106] , SE-ZSL [75] are visual embedding models. From the hybrid or combination category, we compare the results of SAE [72] . Three transductive approaches ALE-tran [6] , GFZSL-tran [171] and DSRL [197] are also presented among the selected models. Due to the intrinsic nature of the transductive setting, the results are competitive and in some cases better than the inductive methods, i.e. for GFZSL-tran [171] the accuracy is 9.9% higher than CVAE-ZSL [106] for PS split of AWA1 dataset. However, in comparison with the inductive form of the same model, there are cases where the inductive model has better accuracies.

i.e. in PS split of the aPY dataset, the performance is 38.4% vs 37.1% or for ALE-tran [6] model in PS split of SUN it's 58.1% vs 55.7%, also for PS split of CUB it is 54.9% vs 54.5% with its inductive type. GFZSL [171] , a compatibility-based approach, has the best scores compared to other models of the same category in every dataset except for the CUB where SJE [7] tops the results in both splits. This superiority could be due to the generative nature of the model. GFZSL [171] performs the best on AWA1 both in inductive and transductive settings. Out of cross-modal methods, SYNC [22] performs better than SSE [204] in SUN and CUB datasets, while for AWA1, AWA2 and aPY in SS split it has lower performance than SSE [204] in the proposed split. Visual generative methods have proved to perform better as they make the problem into the traditional supervised form, among which, SE-ZSL [75] has the most outstanding performance. For the proposed split in one case on CUB dataset, SE-ZSL [75] performs better than ALE-tran [6] which is its transductive counterpart where the accuracies are 59.6% vs 54.5%. In PS split of AWA1, CVAE-ZSL [106] stays at the top, with 1.9% higher accuracy than the second-best performing model. The accuracies for SS splits are higher than PS in most cases and the reason could be the test images included in training samples, especially for AWA1 and AWA2, as reported in [186] .

A more real-world scenario where previously learned concepts are estimated alongside new ones is necessary to experiment. 21 state-of-the-art models, same as ZSL challenge, include: DAP [80] , IAP [80] , CONSE [113] , CMT [157] , SSE [204] , LATEM [185] , ALE [6] , DeViSE [43] , SJE [7] ,ESZSL [137] , SYNC [22] , SAE [72] , GFZSL [171] , DEM [202] , GAZSL [211] , f-CLSWGAN [187] , CVAE-ZSL [106] , SE-GZSL [75] , ALE-train [6] , GFZSL-tran [171] , DSRL [197] . CADA-VAE [145] is added to the comparison as a model combining the visual feature augmentation approach with the cross-modal alignment. CMT* [157] has a novelty detection and is included in the report as an alternative version to CMT [157] . The reports in Table 4 are in PS splits. As shown in the table, the results on y S are dramatically higher than y U since in GZSL, the test search space includes seen classes as well as unseen classes, this gap is the most conspicuous in attribute classifiers like DAP [80] that performs poorly on AWA1 and AWA2, hybrid approaches and in GFZSL [171] where it results in 0% accuracy on SUN and CUB when training classes are estimated at test time. However for three models f-CLSWGAN [187] , SE-GZSL [75] and CADA-VAE [145] in SUN dataset, the accuracy for y U is higher than y S , i.e. for SE-GZSL [75] it is 10.4% higher. For a fair comparison, the weighted average of training and test classes is also reported. According to harmonic means, the best model on all evaluated datasets is SE-ZSL [75] , although the results haven't been reported for aPY. In some cases, the attribute classifier achieves the best results on y S . Transductive models have fluctuating results in comparison with their inductive types. CADA-VAE [145] achieves the best performance in all of the harmonic means cases (results for aPY are not reported) and shows the best results, higher than all of the transductive methods.

For fine-grained problems, sometimes it is important to measure the closeness of previously known concepts to novel unknown ones. For this purpose, a total of eleven models are compared in Table 5 . MCZS [4] , WAC-Linear [37] , WAC-Kernel [36] , ESZSL [137] , SJE [7] , ZSLNS [123] , SynC f ast [22] , SynC OV O [22] , ZSLPP [38] , GAZSL [211] and CANZSL [26] . SCE is the hard split thus has lower results compared to the SCS splits. The two variations reported for SYNC [22] model, SynC f ast denotes the setting in which the standard Crammer-Singer loss is used, and SynC f ast [22] depicts setting with one-versus-other classifiers. The first setting has better accuracies on CUB. CANZSL [26] outperforms all other models in both datasets and splits and improves the accuracy by 4% from 10.3% to 14.3% on SCE split of the CUB dataset and 35.6% vs 38.1% in SCS splits of NAB compared to the next best performing model is GAZSL [211] . Similar to previous experiments, in the seen-unseen relatedness challenge, models that contain feature generating steps have the highest results.

ImageNet is a large-scale single-labelled dataset with an imbalanced number of data that possesses WordNet hierarchy instead of human-annotated attributes, thus is useful mean to measure the performance of various methods in recognitionin-the-wild scenarios. The performances of 12 state-of-the-art models are reported here. They are CONSE [113] , CMT [157] , LATEM [185] , ALE [6] , DeViSE [43] , SJE [7] ,ESZSL [137] , SYNC [22] , SAE [72] , f-CLSWGAN [187] , CADA-VAE [145] and f-VAEGAN-D2 [189] . All of the Top-1 accuracies, except for the data generating models are reported from [186] experiments. As it can be understood from Figure 4b , Feature generating methods have outstanding performance compared to other approaches. Although the results of f-VAEGAN-D2 [189] are available only for 2H, 3H and all splits, it still has the highest accuracies among other models. SYNC [22] and f-CLSWGAN [187] are the next best performing models with approximately the same accuracies. CONSE [113] is a representative model from attribute-classifier based models, as it is also superior to direct compatibility approaches. ESZSL [137] , a model with linear compatibility function outperforms the other model within its category. However, in one case, SJE [7] has slightly better accuracy in L500 split setting. It can be interpreted from the figures that on coarse-grained classes, the results are conspicuously better, while fine-grained classes with few images per class have more challenges. However, if the test search space is too big then the accuracies decrease.

i.e. M5K has lower accuracies compared to L500 splits, and on 20K split, it is the lowest.

The GZSL results are important in the way that they depict the models' ability to recognise both seen and unseen classes at the test time. The results for the SYNC [22] model is only reported in the L5K setting. As shown in Figure 4b , the trend is Similar to ZSL where populated classes have better results than the least populated classes, yet have poor results if the search spaces become too big like the decreasing trends in Table 4 . Generalised Zero-Shot Learning results for the Proposed Split (PS) on SUN, CUB, AWA1, AWA2, and aPY datasets. We measure the Top-1 accuracy in % for seen (y S ), unseen (y U ) and their harmonic mean (H). †and ‡denote inductive and transductive settings, respectively. † most and least populated classes. Moreover, data-generating approaches dominate other strategies. CADA-VAE [145] that has the advantages of both cross-modal alignment and data feature synthesising methods, evidently outperforms other models. In one case, i.e M500, it nearly has double the accuracy of f-CLSWGAN [187] . For the semantic embedding category, although ESZSL [137] had better results on ZSL, it falls behind approaches like ALE [6] , DeViSE [43] and SJE [7] .

During the very recent years, zero-shot learning has proved to be a necessary challenge to-be-solved for different scenarios and applications. The number of demands for learning without accessing to the unseen target concepts is also increasing each year.

Zero-shot learning is widely discussed in the computer vision field, such as object recognition in general, as in [133] and [140] where they aim to locate the objects beside recognising them. Several other variations of ZSL models are proposed for the same task purpose such as [13] , [126] and [30] . Zero-shot emotion recognition [200] has the task of recognising unseen emotions while zero-shot semantic segmentation aims to segment the unseen object categories [19] and [177] . Moreover, on the task of retrieving images from a large scale set of data, Zero-shot has a growing number of research [98] [194] along with sketch-based image retrieval systems [35] , [34] and [150] . Zero-shot learning has an application on visual imitation learning to reduce human supervision by automating the exploration of the agent [117] , [82] . Action recognition is the task of recognising the sequence of actions from the frames of a video. However, if the new actions are not available when training, Zero-shot learning can be a solution, such as in [45] , [124] , [107] and [149] . Zero-shot Style Transfer in an image is the problem of transferring the texture of source image to target image while the style is not pre-determined and it is arbitrary [151] . Zero-shot resolution enhancement problem aims at enhancing the resolution of an image without pre-defined high-resolution images for training examples [154] . Zero-shot scene classification for HSR images [85] and scene-sketch classification has been studied in [191] as other applications of ZSL in computer vision. Zero-shot learning has also left its footprint in the area of NLP. Zero-Shot Entity Linking, links entity mentions in the text using a knowledge base [95] . Many research works focus on the task of translating languages to another without predetermined translation between pairs of samples [50] , [62] , [53] , [78] . In sentence embedding [10] and in Style transfer of text, a common technique is to convert the source to another style via arbitrary styles like the artistic technique discussed in [21] .

In the audio processing field, zero-shot based voice conversion to another speaker's voice [122] is an applicable scenario of ZSL.

In the era of the COVID-19 pandemic, many researchers have tried to work on Artificial Intelligence and Machine learning based methodologies to recognise the positive cases of the COVID-19 patients based on the CT scan images or Chest X-rays. Two prominent features in chest CT used for diagnosis are ground glass opacities (GGO) and consolidation which has been considered by some of the researchers such as [39] , [198] , [92] , and [146] . [111] uses three CNN model to detect COVID-19, in which the ResNet50 shows a very high rate of classification performance. [146] introduces a deep-learning based system that segments the infected regions and the entire lung in an automatic manner. [184] shows that the increase in unilateral or bilateral Procalcitonin and consolidation with surrounding halo is prominent in chest CT of paediatric patients. [88] introduces the COVNet to extract the 2D local and 3D global features in 3D chest CT slices. The method claims the ability of classifying COVID-19 from community acquired pneumonia (CAP). [152] shows different imaging patterns of the COVID-19 cases depending on the time of infection. [208] classifies four stages to respiratory CT scan changes and shows the most dramatic changes to be in the first 10 days from the onset of initial symptoms. [201] introduce a deep learning based anomaly detection model which extracts the high-level features from the input chest X-ray image. [56] introduce COVIDX-Net to classify the positive cases for the COVID-19 in X-ray images. It uses 7 different architectures, which VGG19 outperforms the others. [3] propose a COVID-CAPS that is based on the Capsule Networks [57] to avoid the drawbacks of CNN-based architectures as it captures better spatial information. It performs on a small dataset of Xray images. [1] employ a class decomposition mechanism in DeTraC [2] which is a deep convolutional network that can handle image dataset irregularities of the X-ray images. Zhang et al. [203] propose a method for X-ray medical image segmentation using task driven generative adversarial networks. [128] proposes a 21-layer CNN called CheXNet, trained on the ChestX-ray14 dataset [180] to detect pneumonia with the localisation of the most infected areas from the X-ray images. [139] shows a possible diagnostic criteria could be the existence of bilateral pulmonary areas of consolidation found in the chest X-rays, and [169] use DenseNet-169 for the purpose of feature extraction followed by an SVM classifier to detect Pneumonia from chest X-ray images.

A common weakness among the majority of the abovementioned research works is that they either conduct their evaluations on a very limited number of cases due to the lack of comprehensive datasets (which puts the validity of the reported results under a question), or they suffer from underlying uncertainties due to unknown nature and characteristics of the novel COVID-19, not only for the medical community, but also for the machine learning and data analytic experts. In such an uncertain atmosphere with limited training dataset, we strongly recommend the adaptation of Zero-shot learning and its variances (as discussed in Figure 4 ) as an efficient deep learning based solution towards COVID-19 diagnosis.

Diagnosis and recognition of the very recent and global challenge of COVID-19 disease caused by the severe acute respiratory syndrome Coronavirus 2 (SARS-CoV-2) is a perfect real-world application of Zero-shot learning, where we do not have millions of annotated datasets available; and the symptoms of the disease and the chest X-ray of infected people J o u r n a l P r e -p r o o f may significantly vary from person to person. Such a scenario can truly be considered as a novel unseen target or classification challenge. We only know some of the symptoms of the infected people with COVID-19 in forms of advices, text notes, chest X-ray interpenetration, all as the auxiliary data which have partial similarities with other lung inflammatory diseases, such as Asthma or SARS. So, we have to seek for a semantic relationship between training and the new unseen classes. Therefore, ZSL can help us significantly to cope with this new challenge like the induction of the SARS-CoV-2, from previously learned diagnosis of the Asthma, and the Pneumonia using written medical documents of the respiratory tracts and chest X-ray images. In the case of the few-shot learning, a handful of the chest CT scans or X-ray of the positive cases of the COVID-19 can also be beneficial as further supportset alongside the chest X-ray images of SARS, Asthma and Pneumonia to infer the novel COVID-19 cases.

As a general rule and based on the recent successful applications, we can infer that in any scenarios that the goal is set to reduce supervision, and the target of the problem can be learned through side information and its relation to the seen data, the Zero-shot learning method can be conducted as one of the best learning techniques and practices.

A typical zero-shot learning problem is usually faced with three popular issues that need to be solved in order to enhance the performance of the model. These issues are Bias, Hubness and domain shift; and every model revolves around solving one or more of the issues mentioned. In this section, we discuss efforts done by different approaches to alleviate bias, hubness and domain-shift and infer the logic each approach owns to learn its model.

Bias. The problem with ZSL and GZSL tasks is that the imbalanced data between training and test classes cause a bias towards seen classes at prediction time. Other reasons for bias could be high-dimensionality and the devoid of manifold structure of features. Several data generating approaches have worked on alleviating bias by synthesising visual data for unseen classes. [187] generate semantically rich CNN features of the unseen classes to make unseen embedding space more known. [106] generates pseudo seen and unseen class features, and then it trains an SVM classifier to mitigate bias. [143] improve the quality of the synthesised examples by using gradient matching loss. Models combining data generation or reconstruction along with other techniques have proved to be effective in alleviating bias. [97] use an intermediate space to help discover the geometric structure of the features that previously didn't with the regression-based projections. [24] used calibrated stacking rule. [145] generated latent feature sizes of 64 with the idea that low-dimensional representations tend to mitigate bias. [153] uses two regressors to calculate reconstruction to diminish the bias. Transductive-based approaches like [143] are also used to solve the bias issue. In [159] , it forces the unseen classes to be projected into fixed pre-defined points to avoid results with bias.

Hubness [125] . In large-dimensional mapping spaces, samples (hubs) might end up falsely as the nearest neighbours of several other points in the semantic space and result in an incorrect prediction. To avoid the hubness, [176] propose a stage-wise bidirectional latent embedding framework. When a mapping is done from high-dimensional feature space to a low-dimensional semantic space using regressors, the distinctive features will partially fade while in the visual feature space, the structures are better preserved. Hence, the visual embedding space is well-known for mitigating the hubness problem. [174] and [202] use the output of the visual space of the CNN as the embedding space.

Domain-shift. Zero-shot learning challenge can be considered as a domain adaptation problem. This is because the source labelled data is disjoint with the target unlabelled domain data. This is called project domain-shift. Domain adaptation techniques are used to learn the intrinsic relationships among these domains and transfer knowledge between the two. A considerable amount of works has been done through a transductive setting which has been successful to overcome the domain-shift issue. [44] a multi-view embedding framework, performs label propagation on graph a heuristic one-stage self-learning approach to assign points to their nearest data points. [71] introduces a regularised sparse coding based unsupervised domain adaptation framework that solves the domain shift problem. [206] use a structured prediction method to solve the problem by visually clustering the unseen data. [174] use a visual constraint on the centre of each class when the mapping is being learned. Since the pure definition of the ZSL challenge is the inaccessibility of unseen data during training, several inductive approaches tried to solve the problem as well. [72] propose to reconstruct the visual features to alleviate this issue. [197] perform sparse non-negative matrix factorisation for both domains in a common semantic dictionary. MFMR [193] exploits the manifold structure of test data with a joint prediction scheme to avoid domain shift. [138] use entropy minimisation in optimisation. [86] preserve the semantic similarity structure in seen and unseen classes to avoid the domain-shift occurrence. [87] mitigates projection domain-shift by generating soul samples that are related to the semantic descriptions.

These three common issues together with inferiorities of each methods will be a motivation to decide on a particular approach when solving the ZSL problem. Attribute classifiers are considered customised since human-annotations are used; however, this makes the problem a laborious task that has strong supervision. Compatibility learning approaches have the ability to learn directly by eliminating the intermediate step but often face with the bias and hubness problem. Manifold learning solves this weakness of the semantic learning approaches by preserving the geometrical structure of the features. Cross-modal latent embedding approaches take on a different point of view and leverage both visual and semantic features and the similarity and differences between them. They often propose methods for aligning the structures between the two modes of features. This category of methods also suffers from the hubness problem for the problems dealing with highdimensional data. Visual space embedding approaches have the advantage of turning the problem into a supervised one by generating or aggregating visual instances for the unseen classes. Plus are a favourable approach for solving hubness problem due to the high-dimensionality of the visual space that can preserve information structure better and also bias problem by alleviating the imbalanced data by generating unseen class samples. Here a challenge would be generating more realistic looking data. Another different setting is transductive learning that present solutions to bias problem, by creating balance in data by gathering unseen data, yet not applicable to many of the real-world problems since the original definition of ZSL limits the use of unseen data during the training phase.

Depending on the real-world scenarios, each way of solving the problem might be the most appropriate choice. Some approaches improve the solution by combining two or more methods to benefit from each one's strengths.

In this article, we performed a comprehensive and multifaceted review on the Zero-Shot/Generalised Zero-shot Learning challenge, its fundamentals, and variants for different scenarios and applications such as COVID-19 diagnosis, Autonomous Vehicles, and similar complex real-world applications which involve fully/partially new concepts that have never/rarely seen before, besides the barrier of limited annotated dataset. We divided the recent state-of-the-art methods into four space-wise embedding categories. We also reviewed different types of side and auxiliary information. We went through the popular datasets and their corresponding splits for the problem of ZSL. The paper also contributed in performing the experiment results for some of the common baselines and elaborated on assessing the advantages and disadvantages of each group, as well as the ideas behind different areas of solutions to improve each group. Our evaluation reveals that data synthesis methods and combinational approaches yield the best performance, as by synthesizing data, the problem shifts to the classic recognition/diagnosis problem, and by combining other methods, the model utilises the advantage of each embedding techniques. The models even outperform compatibility learning models in transductive setting. This means, the models consisting a visual data generation step, lead to better results than other approaches and settings. Furthermore, the accuracies improve when the unseen classes have closer semantic hierarchy and relatedness distance to the seen classes. Finally, we reviewed the current and potential real-world applications of ZSL and GZSL in the near future. To the best of our knowledge, such a comprehensive and detailed technical review and categorisation of the ZSL methodologies, alongside with an efficient solution for the recent challenge of COVID-19 pandemic is not done before; hence, we expect it to be helpful in developing new research directions among AI and health-related research community. [ 

In this appendix, we provide an concise overview of the main specifications, mathematical formulas, and notations of the 26 state-of-the-art methods that we discussed and compared during this research, in a top-down matter.

DAP [80] acronym of Direct Attribute Prediction, first learns the posteriors of the attributes, then estimates IAP [80] is an indirect approach as it first learns the posteriors for seen classes and then uses them to compute the posteriors for the attributes:

where N y S is the number of training classes, p(a m |y S ) is the pre-trained attribute of the classes and p(y S |x) is probabilistic multi-class classifier to be learned.

CONSE [113] takes a probabilistic approach and predicts an unseen class by the convex combination of the class label embedding vectors. It first learns the probability of the training samples:

in which y is the most probable label for the training sample. It then computes a weighted combination of the semantic embedding to its probability to find a label for a given unseen image.

In this function, Z is the normalisation factor and s combines N T semantic vectors to infer unseen labels. Linear corresponding functions are the simplest mapping functions that are typically used to map visual features to semantic spaces in vector form. If the mapping parameters are in the form of matrix, then it's called bi-linear corresponding (compatibility) function. These approaches often include other losses alongside the main mapping function.

ESZSL [137] , introduces a better regulariser and optimises a close form solution objective function in a bi-linear manner.

α, β, and γ are the hyper-parameters. The first two terms are the Frobenius norm of the attribute features and visual features respectively, and the third term is the weight decay penalty of the matrix.

proposes a l 1,2 -norm based loss function and an optimiser based on [137] to help suppress the noise in textual data.

optimises the ranking loss in [167] with a bi-linear mapping compatibility function. The objective function used in ALE is similar to unregularised structured SVM (SSVM) [166] .

were F(.) is the compatibility function, W is the matrix with dimensions of image and label embeddings, and ∆ is the loss of the mapping function. In spite of having different losses, the inspiration comes from WSABIE algorithm [183] . In ALE, rank 1 loss with a multi-class objective is used instead of all of the weighted ranks.

SJE [7] similar to ALE, it learns a bi-linear compatibility function using the unregularised structural SVM objective function [166] and train their model on different supervised and unsupervised class embeddings.

DeViSE [43] uses the combination of dot-product similarity and hinge rank loss used in [183] as the objective function.

Here, α is a hyperparmeter and c( j) are randomly selected word embeddings.

also uses bi-linear corresponding function for a part-based cross-modal framework. The visual part detectors detects bird parts from the images, while the zeroshot classifier detects performs prediction on the previously detected visual bird parts based on textual side information.

DSRL [197] uses a non-negative sparse matrix factorisation for better feature alignment while learning the compatibility function. And uses label propagation to predict unseen classes.

The NMF is computed as follows:

Here, φ and Z are dictionary and the latent representation of matrix respectively. α and β are the hyperparameters.

SAE [72] or Semantic Auto Encoder, uses an AutoEncoder combines two linear mapping functions, one for the visual space and the other one for semantic space. In this way, the decoded visual feature, produces semantically meaningful features after the mapping to the semantic space. The objective to be minimised is as follows:

where W T and W are decoder and encoder projection matrices. And λ is a hyperparameter. [37] combines a regression function that solely maps semantic features to the visual space, and a knowledge transfer function, to map the textual descriptions to the visual space.

Some approaches use non-linear compatibility functions to solve the ZSL challenge.

CMT [157] use a two-layer neural network, similar to common MLP networks by [131] that minimises the objective function (1) , θ (2) ).

GFZSL [171] introduces both linear and non-linear regression models in a generative approach as it produces a probability distribution for each class. It then uses MLE for estimating seen class parameters and two regression functions for unseen categories.

where µ is the Gaussian mean vector and σ is the diagonal Covariance matrix of the attribute vector. In its transductive setting, it uses Expectation-Maximisation (EM) that works like estimation a Gaussian Mixture Model (GMM) of unlabelled data in an iterative manner. The inferred labels will be included in the next iterations.

LATEM [185] learns several mappings and selects one to be the latent variable for a pair of image and class. The selected latent embedding learns a piece-wise non-linear compatibility function alongside a ranking loss. Its compatibility function is

.., K with K ≥ 2 are the indexes over latent choices.

DEM [202] and WAC-Kernel [36] learn non-linear mapping in the inverse direction from different types of class embeddings. i.e. textual data. WAC-Kernel uses a kernel method for the integration of side information. The objective function for DEM is

that looks like a ridge regression.

Some of the methods consider cross-modal feature similarity in a mutual space.

SSE [204] learns two embedding functions, one being ψ which is learned from seen class auxiliary information and the other one from seen data which is target class π embedding and predicts unseen labels via maximising the similarity between histograms:

SYNC [22] considers the mapping between the semantic space of the external information and the model space. it introduces phantom classes to align the two spaces. The classifier is trained with the sparse linear combination of the classifiers for the phantom classes: 16) where w c and v r are weighted graphs of the real and phantom classes respectively. While s cr is the bipartite graph of those to previously graph combinations.

MCZSL [4] combines compatibility learning with Deep Fragment embeddings [67] in a joint space. Their visual part and multi-cue language embedding are defined as follows, respectively:

In this equation, l m and E language m are the language encoder for each modality. f (.) is the language token from the m modality and ReLU, respectively. Also, E visual is the visual encoder and CNN θ (I b ) is the part descriptor extracted from bounding box I b for the image part annotation b. Hence the complete objective function is as follows:

where w is the parameters of the two encoders and α is the hyperparameter.

Several methods decide to generate images or image features using different visual data synthesis techniques. some of them are VAE-based [69] . [145] learns latent space features and class embedding by training VAE [69] for both visual and semantic modalities. used Cross-Alignment (CA) Loss to align latent distributions in cross-modal reconstruction:

where, i and j are two different modalities. Wasserstein distance [46] is used between the latent distributions i and j to align the Latent distribution (LDA):

µ and η are predictions of the encoder. [75] adds an extra loss term named "feedback loss" to the VAE objective [69] function that works as a discriminator to enforce the similarity of the generated sample to the original distribution. The regressor feedback term is as follows:

where z is a random noise. [158] , conditioned on attributes, alongside the L 2 norm for reconstruction loss. The objective function of a CVAE is:

where c is the condition. It then trains a SVM classifier [28] for unseen categories.

Other approaches introduce GAN-based [48] methods.

GAZSL [211] adds a visual pivot regulariser to GAN's objective function. This regulariser aims to reduce the noise of Wikipedia articles.

Due to the inaccessibility of data, empirical expectations

are used instead; where N S and N U are the number of samples in class y S and number of synthesised features in class y U , respectively.

f-CLSWGAN [187] combines three conditional GAN [48] variants: GAN, conditional WGAN [51] and a classification loss, and name their method f-CLSWGAN.

The classification loss is like a regulariser for the enhancement of the generated features and β is a hyperparameter.

uses GAN for generating visual features and an inverse GAN to project them back to the semantic space. In this way, the produced features are consistent with their corresponding semantic features.

f-VAEGAN-D2 [189] introduces a generative model that integrates VAE and WGAN. In this model, the decoder of VAE and the generator of the WGAN are the same component, and there are two discriminators (D 1 , D 2 ) for this model. The full objective function to be optimised is as follows:

Object recognition is one of the highly researched areas of computer vision. Recent recognition models have led to great performance through established techniques and large annotated datasets. After several years of research, the attention over this topic has not only dimmed but it has been proven that there are still ways and rooms to refine models to eliminate existing issues in this area. The number of newly emerging unknown objects are growing. Some examples of these unseen or rarely-seen objects are futuristic object designs like the next generation of concept cars, other existing concepts but with restricted access to them (such as licensed or private medical imaging datasets), or rarely seen objects (such a traffic signs with graffiti on them), or fine-grained categories of objects (such as detection of COVID-19 in comparison with the easier task of detecting a common pneumonia). This brings the necessity of developing a fresh way of solving object recognition problems that concern lesser human supervision and lesser annotated datasets. Several approaches have tried to gather web images to train the developed deep learning models, but aside from the problem of the noisy images, the searched keywords are still a form of human supervision. One-Shot learning (OSL) and Few-shot learning (FSL) are two solutions that are able to learn new categories via one or a few images, respectively [104] , [76] , [70] .

Natural language processing (NLP) is another major area of research in AI and the application of Few-shot learning in the integration of NLP and object recognition has become a hot topic recently. [165] was the first FSL-based model to improve the performance of an NLP system. Zero-shot learning (ZSL) [80] , [7] , [188] , [38] , [178] is an emerging research which is completely free of any laborious task of data collection and annotation by experts. Zero-shot learning is a novel concept and learning technique without accessing any exemplars of the unseen categories during training, yet it is able to build recognition models with the help of transferring knowledge from previously seen categories and auxiliary information. The auxiliary information may include textual description, attributes, or vectors of word labels. This means the ZSL is interdisciplinary by nature with two inseparable components of visual and textual data.

One of the interesting facts about ZSL is its similarity with the way human learns and recognise a new concept without seeing them beforehand. For example, a ZSL-based model would be able to automatically learn and diagnose COVID-19 patients, based on the existing chest X-ray images of patients with asthma and lungs inflammatory diseases which are already recognised and labelled by clinicians, plus some new auxiliary information about the COVID-19 attributes. Here, the auxiliary data can be the description of physicians and clinicians about the unique type of visual patterns, features, damages, or differences they have noticed on the Chest X-ray of patients with positive COVID-19 comparing to asthma X-ray images. A similar concept or approach is applicable in autonomous vehicles, [132] , where a self-driving car is responsible for automatic detection of surrounding cars including e.g. an unseen Tesla concept car based on the subgroup of labelled classic sedan cars plus auxiliary information about the common differences of concept cars than the classic cars; or recognising a Persian deer, based on the auxiliary information available for it and its appearance similarities or differences with other previously known deer. For instance, it belongs to a subgroup of the fallow deer, but with a larger body, bigger antlers, white spots around the neck, and also flat antlers for the male type. Figure 1(a) shows three examples of Posterior-Anterior (PA) and AP projection of chest X-rays of positive cases of COVID-19, and Figure 1(b) represents their corresponding axial CT scans, taken from the COVID-ChestXRay dataset [27] . As it can be seen in the images, common evident anomalies may include unilateral or bilateral patchy ground-glass opacities (GGOs), patchy consolidations and parenchymal thickening. The goal of this research is to build an artificial intelligence based-model that can diagnoses COVID-19 without providing any visual exemplars in the training phase. In that case, the side (auxiliary) information should be provided to assist diagnosis in the test phase. In Figure 2 , the auxiliary information is provided in the form of textual descriptions for two examples of concept cars and COVID-19 X-rays. In Figure 2 (a) we aim at distinguishing new unseen concept cars (bottom row), using description on the exterior of the target and how it differs an already learned car from existing classic vehicle classification system such as in [133] . Similarly, visual differences and similarities between healthy Chest X-rays, Asthma cases, and COVID-19 positive cases are described in Figure 2 (b) as the auxiliary information.

Let's assume our pre-trained AI-based medical imaging system is capable of detecting Asthma cases, based on common deep learning techniques using a previously large dataset of labelled Asthma Chest X-ray images. However, these days we are facing an unknown COVID-19 pandemic with very limited annotated Chest X-rays. Obviously, we can not proceed on the same way of training traditional deep-learning methods, due to very sparse labelled images for COVID-19. The good point is that our medical experts and clinicians can provide some auxiliary information (textual descriptions) about common features and similarities among the COVID-19 positive chest X-rays to infer their findings. In Figure 3 , the side information is provided in form of what "attributes": such as foggy effects, white spot features, blurred edges, and white/low-intensity pixel dominance in various areas of the chest X-ray images of COVID-19 patients.

Our idea behind the utilisation of ZSL models is to detect, understand, and recognise new concepts using an existing similar deep-learning based classifier, plus the integration of auxiliary information. This turns it to a completely new and efficient detector/recogniser or diagnosing system without the requirement of collecting a new dataset and a vast amount 

of costly and time-consuming labelling, especially when a speedy solution is crucial and life-saving; such as the recent global pandemic.

In this research we will have four main contributions, as follows:

• We propose to categorise the reviewed approaches based on the embedding spaces that each model uses to learn/infer unseen objects/concepts as well as describing the variations to the data embedding inside those embedding spaces ( Figure 3 and Table 1 ).

• We evaluate the performance of the state-of-the-art models on famous benchmark datasets (Table 3 -5, Fig. 4 ). To the best of our knowledge, we are the first to include the evaluation of data-synthesising methods in the research field of applied Zero-shot learning.

• We study the motivation behind leveraging each space as a way to solve the ZSL challenge by reviewing current issues and solutions to them.

• We provide sufficient technical justifications to support the ideas of using the proposed ZSL model as one of the best practices for COVID-19 diagnosis and other similar applications.

The rest of the materials in the article is organised as follows. In Section 2, we introduce the problem of Few-shot, One-shot and Zero-shot learning. In Section 3, we discuss about the test and train phases of the Zero-shot learning and generalised Zero-shot learning systems. Section 4 provides with embedding approaches followed by evaluation protocols in Section 5. In Section 6, we analyse the outcome of the experiments performed on different state-of-the-art methodologies. Further discussion about the applications of ZSL is investigated in Section 7. In Section 8, we discuss the outcome of this research, and finally, the concluding remarks in Section 9.

Few-shot learning (FSL) is the challenge of learning novel classes with a tiny training dataset of one or a few images per category. FSL is closely related to knowledge transfer where a model, previously trained on large data, is used for a similar task with fewer training data. The more the transferred knowledge is accurate, the better FSL will generalise. Moreover, many approaches employ meta-learning to learn the challenge of fewshot or few-example learning [156] , [64] . The main challenge is to improve the generalisation ability as it often faces the overfitting problem.

In this type of learning, there is an auxiliary dataset that contains classes each having annotated samples of the new examples in the training phase. This makes the problem into a N-way-K-shot classification:

where is the ℎ training example and is its corresponding label. = × denotes the number of N categories and defines the number of examples. Few-shot learning has > 1 samples.

Among the relevant research works, [163] use the shared features among classes to compensate for the requirement for the large data, and follows a learning procedure based on boosted decision stumps. HDP-DBM [141] develops a compound of a deep Boltzmann machine and a hierarchical Dirichlet process to learn the abstract knowledge at different hierarchies of the J o u r n a l P r e -p r o o f concept categories. [156] Proposes prototypical networks that computes Euclidean distance between prototype representations of each class. It was not until recently that Few-shot learning was introduced in computer-aided diagnosis. For the first time, the idea of using additional information (attributes) in FSL, was introduced in [165] . [121] proposes a model to classify skin lesions. [68] use FSL for Glaucoma Diagnosis from fundus images. [127] study the problem of chest X-ray classification of five symptoms including Consolidation.

In the case of one-shot learning, there is only = 1 example per class in the supporting set, thus it faces more challenge in comparison to the FSL. Bayesian Program Learning (BPL) framework [77] present each concept of the handwritten characters as a simple probabilistic program. [14] proposes cross-generalisation algorithm. It replaces the features from the previously learned classes with similar features of the novel classes to adapt to the target task. In Bayesian learning, [41] depicts prior knowledge in the form of probability density function on the parameters of the model, and updates them to compute the posterior model. Matching Nets (MN) [172] uses non-parametric attentional memory mechanisms, and an "episode" during the training time. [25] capture salient features of general lung datasets using an encoder and augment multiple views for images, then use the prototypical network for a 2-way, 1-shot classification.

Zero-shot learning is the extreme case of the FSL where = 0. In other words, the difference between the two is the devoid of any visual examples of the target classes in the training phase of ZSL, while in few-shot learning, the support set contains few labelled samples of the novel categories. Also, auxiliary information in the form of class embeddings is one of the main components of Zero-shot learning. ZSL approaches might extend their solutions to one-shot or few-shot learning by either updating the training data with one or few generated samples from augmentation techniques, or by having access to a few of the unseen images during the training time [145] , [199] , [147] , [5] , [59] , [17] , [171] [23] , [164] , [189] , [145] . [189] and [145] both use auxiliary text-based information.

ZSL models can be seen from two points of views in terms of training and test phase: Classic ZSL and Generalised ZSL (GZSL) settings. In the classic ZSL settings, the model only detects the presence of new classes at the test phase, while in GZSL settings, the model predicts both unseen and seen classes at the test time; hence, GZSL is more applicable for real-world scenarios [94] , [75] , [210] , [86] , [145] . The same idea can be applied to FSL to train in the generalised model, called generalised few-shot learning (GFSL) that detects both known and novel classes at the test time.

In the next paragraphs, we discuss two types of training approaches: Inductive vs. Transductive training.

Inductive Training: This training setting only uses the seen class of information to learn a new concept. The training data for the inductive setting is:

where represents image features, is the class labels, and ( ) denotes the class embeddings. Moreover, and indicate seen class images and seen class labels, respectively. Inductive learning accounts for the majority of the settings used in ZSL and Generalised Zero-Shot Learning (GZSL). e.g. in [7] , [43] , [113] , [137] , [52] , [204] , [22] , [207] , [90] , [171] , [189] .

Transductive Training: Although the original idea of zeroshot learning is more related to the inductive setting, in many scenarios, the transductive setting is used where either unlabelled visual or textual information, or both for unseen classes are used together with the seen class data e.g. in in [134] , [71] , [44] , [6] , [205] , [192] , [52] , [207] , [90] , [171] , [159] , [174] , [189] , [143] . The training data for transductive learning is:

where ∪ denotes that images come from the union of seen and unseen classes. Similarly, ∪ and ∪ indicate the train labels and class embeddings belong to both seen and novel categories. According to [197] , any approach that relies on label propagation will fall into the category of transductive learning. Feature generating network with labelled source data and unlabelled target data [189] are also considered as transductive methods. The transductive setting is seen as one of the solutions to the domain shift problem, since the provided unseen labelled information during training reduces the discrepancy between the two domains.

There is a slight nuance between the transductive learning and semi-supervised learning; in the transductive setting, the unlabelled data solely belong to the unseen test classes, while in semi-supervised setting, unseen test classes might not be present in the unlabelled data. Furthermore, the difference between FSL and the transductive ZSL learning is the existence of a few labelled examples of the unseen classes alongside annotated seen class examples in the few-shot learning. While in the transductive ZSL setting, the examples for the unseen classes are all unlabelled.

ZSL models are developed based on two high-level major strategies to be taken into account: a) defining the "Embedding Space" to combine visual and non-visual auxiliary data, and b) choosing an appropriate "Auxiliary Data Collection" technique. a) Embedding Spaces. Figure 3 demonstrates the overall structure of a ZSL system in terms of embedding spaces and auxiliary data types collection techniques. Such systems either map the visual data to the semantic space (Figure 3 .a) or embed both visual and semantic data to a common latent space (Figure 3.b) , or see the task as a missing data problem, and then map the semantic information to the visual space (Figure 3.c) . Two or all of these approaches can also be combined and embedded together to boost up the benefits of each individual categories.

From a different point of view, semantic spaces can also be sub-categorised into euclidean and non-euclidean spaces. The intrinsic relationship between data points is better preserved when the geometrical relation between them is considered. These spaces are commonly based on clusters or graph networks. Some researchers may prefer manifold learning for the ZSL challenge. e.g. in [134] , [175] , [207] , [192] , [193] , [91] , [181] , [83] , [63] , [210] . The Euclidean spaces are more conventional and simpler as the data has a flat representation in such spaces. However, the loss of information is a common issue of these spaces, as well. Examples of methods using Euclidean spaces are [80] , [43] , [137] , [187] , [106] , and [145] . b) Auxiliary Data Collection. As mentioned before, Zero-shot learning is the challenge of learning novel classes without seeing their exemplars during the training. Instead, the freely available auxiliary information is used to compensate for the lack of visually labelled data. Such information can be categorised into two groups:

Human annotated attributes. The supervised way of annotating each image with its related attributes is an arduous process and requires time and expertise, but since they are manual, they yield noiseless and important attributes needed for learning and inference. There are several datasets in which side information in the form of attributes can be attained for each image. i.e. aPY [40] , AWA1 [80] , AWA2 [188] , CUB [173] , and SUN [118] . Several ZSL methods leverage the attributes as the side information [7] , [137] , [97] , or visual attributes [79] , [40] .

Unsupervised auxiliary information. There are several forms of auxiliary information that have minimum supervision and are widely used in the ZSL setting, such as human gazes [66] , WordNet which is a large-scale lexical database of 117,000 English words [136] , [135] , [5] , [7] , [185] , [100] , [4] , [123] , [102] , [181] , [83] , or Textual descriptions such as Web search [135] , Wikipedia articles [43] , [113] , [37] , [7] , [84] , [4] , [123] , [38] , [112] , [211] , and sentence descriptions [129] . Textual side information needs to be transformed into class embeddings in order to be used at the training stage and testing stages. Word embedding and language embedding are the two representation techniques used for textual side information. As we gradually proceed, later we review on different embedding classes as well.

In this section, we first provide the task definition of ZSL and GZSL. Then we review the four recent approaches on the problem.

In the standard inductive setting as mentioned earlier in Section 3, the training set is

and the objective function to be minimised is as follows:

where, ( , ; ) = argmax ∈ ( , ) is the mapping function.

Through the training phase, the classifier : → is learned for ZSL to predict only the novel classes at the test time, or : → ∪ for the GZSL challenge to estimate both novel classes and the previously learned seen classes. For instance, the classifier can be a COVID-19 diagnoser.

We categorise the embedding methodologies into four categories based on the space they learn/infer target classes (like COVID-19 detection in Figure 3 ):

is done with visual feature representations similar to the traditional recognition problems. 4. Hybrid Embedding Models: A combination of spaces are used in some models to bring together the advantages the different spaces have.

The majority of methods focus on the general tasks; however, they are scalable to disease classification.

Semantic embedding itself can be sub categorised into two tasks of Attribute Classification and Label Embedding which will be discussed here:

Primitive approaches of Zero-Shot learning leverage manually annotated attributes in a two-stage learning schema. Attributes in an image are predicted in the first stage and labels of unseen classes would be chosen using similarity measures in the second stage. [79] use a probabilistic classifier to learn the attributes and then estimates posteriors for test classes. [136] propose a method to avoid manual supervision with mining the attributes in an unsupervised manner. [135] adopt DAP together with a hierarchy-based knowledge transfer for large-scale settings. [65] 's method is based on IAP, and uses Self-Organising and Incremental Neural Networks (SOINN) to learn and update attributes online. Later in IAP-SS by [65] , an online incremental learning approach is used for faster learning of the new attributes. The Direct Attribute Prediction (DAP) [80] first learns the posteriors of the attributes, then estimates the posteriors of seen classes. On the other hand, Indirect Attribute Prediction (IAP) [80] first learns the posteriors for seen classes then uses them to compute the posteriors for the attributes. [179] use a unified probabilistic model based on the Bayesian Network (BN) [110] that discovers and captures both object-dependent and object-independent relationships to overcome the problem of relating the attributes. CONSE [113] learns the probability of the training samples. It then predicts an unseen class by the convex combination of the class label embedding vectors. [59] use a random forest approach for learning more discriminative attributes. Hierarchy and Exclusion (HEX) [31] considers relations between objects and attributes and maps the visual features [161] , [130] of the images to a set of scores to estimate labels for unseen categories. [8] take on an unsupervised approach where they capture the relations between the classes and attributes with a three-dimensional tensor while using a DAP-based scoring function to infer the labels. LAGO by [12] also follow the DAP model. It learns soft and-or logical relations between attributes. Using soft-OR, the attributes are divided into groups, and the label class from unseen samples is predicted via a soft-AND within these groups. If each attribute comes from a singleton group, the all-AND will be used.

Instead of using an intermediate step, more recent approaches learn to map images to the structured euclidean semantic space automatically which would be the implicit way of representing knowledge. The compatibility function for linear mapping is:

where ( ) is the image embedding for training classes and is the parameters in vector form to be learned. In the case of bilinear projection where it is more common, takes the form of matrix:

SOC [114] first maps the image features to the semantic embedding space, it then estimates the correct class using nearest neighbour. DeViSE by [43] uses a linear corresponding function with a combination of dot-product similarity and hinge rank loss used in [183] . ALE [6] optimises the ranking loss in [167] alongside the bi-linear mapping compatibility function. SJE [7] learns a bi-linear compatibility function using the structural SVM objective function [166] . ESZSL [137] introduces a better regulariser and optimises a close form solution objective function in a linear manner. ZSLNS [123] proposes a 1,2 -norm based loss function. [17] take on a metric learning approach and linearly embed the visual features to the attribute space. LAGO [12] is a probabilistic model that depicts soft and-or relations between groups of attributes.

In a case where all attributes form all-OR group, It becomes similar to ESZSL [137] and learns a bilinear compatibility function. AREN [190] uses attentive region embedding while learning the bilinear mapping to the semantic space in order to enhance the semantic transfer. ZSLPP [38] combines two networks VPDE-net for detecting bird parts from images and PZSC-net that trains a part-based Zero-Shot classifier from the noisy text of the Wikipedia. DSRL [197] uses non-negative sparse matrix factorisation to align vector representations with the attribute-based label representation vectors so that more relevant visual features are passed to the semantic space. Some approaches to ZSL use non-linear compatibility functions. CMT [157] use a two-layer neural network, similar to common MLP networks by [131] alongside the compatibility function. In UDA [71] a non-linear projection from feature space to semantic space (word vector and attribute) is proposed in an unsupervised domain adaptation problem based on regularised sparse coding. [84] use a deep neural network [161] regression which generates pseudo attributes for each visual category via Wikipedia. LATEM [185] constructs a piece-wise non-linear compatibility function alongside a ranking loss. [23] regularise the model using structural relations of the cluster by which cluster centres characterise visual features. QFSL by [159] solves the problem in a transductive setting, and projects both sources and target images into several specified points to fight bias problem.

GFZSL [171] uses both linear and non-linear regression models and generates a probability distribution for each class. For transductive setting, it uses Expectation-Maximisation (EM) to estimate a Gaussian Mixture Model (GMM) of unlabelled data in an iterative manner.

Leveraging the non-euclidean spaces to capture the manifold structure of the data is another approach to the problem.

Together with the knowledge graphs, the explicit relations between the labels will be demonstrated. In this setting, the side information mainly comes from a hierarchy ontology like WordNet. The mapping function will have the following form:

where is the × feature matrix and is the adjacency matrix of the graph. Propagated Semantic Transfer (PST) [134] first uses DAP model to transfer knowledge to novel categories, following the graph-based learning schema, it improves local neighbourhood in them. DMaP [91] jointly optimises the projecting of the visual features and the semantic space to improve the transferability of the visual features to the semantic space manifold. MFMR [193] decomposes the visual feature matrix into three matrices to further facilitate the mapping of visual features to the semantic spaces. To improve the representation of the geometrical manifold structure of the visual and semantic features, manifold regularisation is used. In [83] a Graph Search Neural Network (GSNN) [102] is used in the semantic space based on the WordNet knowledge graph to predict multiple labels per image using the relations between them.

[181] distils both auxiliary information in forms of word embedding and knowledge graph to learn novel categories. DGP [63] proposes dense graph propagation to propagate knowledge directly through dense connections. In [210] a graphical model with a low dimensional visually semantic space is utilised which has a chain-like structure to close the gap between the high-dimensional features and the semantic domain.

One of the methods of embedding is to measure the similarity between the visual and semantic features in a joint space.

Considering unseen classes as a fusion of previously learned seen concepts is called hybrid learning. Standard scoring function for hybrid models is defined as:

SSE [204] considers the histogram similarity between the seen class auxiliary information and seen visual data. In SYNC [22] uses two spaces of semantic and model space, and the alignment is conducted with phantom classes. With the sparse linear combination of the classifiers for the phantom classes, the final classifier is learned. TVSE [192] learns a latent space using collective matrix factorisation with graph regularisation to incorporate the manifold structure between source and target instances, moreover, it represents each sample as a mixture of seen class scores. LDF [93] combines the prototypes of seen classes and jointly learns embeddings for both user-defined attributes and latent attributes.

Inferring unseen labels via measuring similarity between cross-modal data in a shared latent space is another workaround to the ZSL challenge. The first term in the objective function for standard cross-modal alignment approaches is:

with being a One-hot vector of corresponding class labels and . 2 is the Frobenius norm. Approaches to joint space learning are grouped into two categories, Parametric which follow a slow learning via optimising a problem and Nonparametric that leverage data points extracted from neural networks in a shared space. In parametric methods including [44] a multi-view alignment space is proposed for embedding low-level visual features. The learning procedure is based on the multi-view Canonical Correlation Analysis (CCA) [47] . [100] applies PCA and ICA embeddings to reveal the visual similarity across the classes and obtains the semantic similarity with the WordNet graph, followed by embedding the two outputs into a common space. MCZSL [4] uses visual part and multi-cue language embedding in a joint space. In [108] both images and words are represented by Gaussian distribution embeddings. JLSE [205] decides on a dictionary learning approach to learn the parameters of source and target domains across two separate latent spaces where the similarity is computed by the likelihood of similarity independent to the class label. CDL [61] uses a coupled dictionary to align the structure of visual-semantic space using discriminative information of the visual space. In [73] and [138] a coupled sparse dictionary is leveraged to relate visual and attribute features together. It uses entropy regularisation to alleviate the domain shift problem.

There are several non-parametric methods. ReViSE [164] that combines auto-encoders with Maximum Mean Discrepancy (MMD) loss [49] in order to align the visual and textual features. DMAE [109] introduces a latent alignment matrix with representations from auto-encoders optimised by kernel target alignment (KTA) [29] and squared-loss mutual information (SMI) [195] . DCN [94] proposes a novel Deep Calibration Network in which an entropy minimisation principle is used to calibrate the uncertainty of unseen classes as well as seen classes.

To narrow the semantic gap, BiDiLEL [176] introduces a sequential bidirectional learning strategy and creates a latent space using the visual data, then the semantic representations of unseen classes are embedded in the previously created latent space. This method comprises both parametric and non-parametric models.

Visual embedding is the other type of ZSL methods that performs classification in the original feature space and is orthogonal to semantic space projection. This is done by learning a linear or non-linear projection function. For linear corresponding functions, WAC-Linear [37] uses textual description for seen and unseen categories and projects them to J o u r n a l P r e -p r o o f the visual feature space with a linear classifier. [207] follows a transductive setting in which it refines unseen data distributions using unseen image data. To approximate the manifold structure of data, they used a global linear mapping for synthesising virtual cluster centres. [52] assigns pseudo labels to samples using reliability (with robust SVM) and diversity (via diversity regularisation). For learning Ia Non-linear corresponding function, In WAC-Kernel [36] in order to leverage any kind of side information, a kernel method is proposed to predict a kernel-based on the representer theorem [144] . DEM [202] uses the least square embedding loss to minimise the discrepancy between the visual features and their class representation embedding vector in the visual feature space. OSVE [96] reversely maps from attribute space to visual space then trains the classifier using SVM [11] . In [60] the authors introduce a stacked attention network that corporates both global and local visual features weighted by relevance along with the semantic features. In [174] visual constraint is used in class centres in the visual space to avoid the domain shift problem.

There are a variety of generative networks that augment unseen data, taking GAN [48] as an example, the first term in objective function would be:

= ( , ( )) is the synthesised data of the generator and ∈ is random Gaussian noise. The role of the discriminator and generator contradicts in loss function as the first one attempts to maximise the loss while the latter tries to minimise it. Another widely used generative neural network is the Variational AutoEncoder (VAE) [69] :

The first term is the reconstruction loss, and the latter is the Kullback-Leibler divergence that works as a regulariser. RKT [175] leverages relational knowledge of the manifold structure in the semantic space, and generates virtually labelled data for unseen classes from Gaussian distributions generated by sparse coding. Then it projects them alongside the seen data to the semantic space via linear mapping. GLaP [90] generates virtual instances of an unseen class with the assumption that each representation obeys a prior distribution where one can draw samples from. To ease the embedding to the semantic space, GANZrl [162] proposes to increase the visual diversity by generating samples with specified semantics using GAN models. SE-GZSL [75] uses a feedback-driven mechanism for its discriminator that learns to map the produced images to the corresponding class attribute vectors. To enforce the similarity of the distribution of the sample and generated sample, a loss component was added to the VAE objective [69] function.

Synthesised images often suffer from looking unrealistic since they lack intricate details. A way around this issue is to generate features instead. [18] uses a GMMN model [89] to generate visual features for unseen classes. In [42] a multi-modal cycle consistency loss is used in training the generator for better reconstruction of the original semantic features. CVAE-ZSL [106] takes attributes and generates features for the unseen categories via a Conditional Variations AutoEncoder (CVAE) [158] . 2 norm is used as the reconstruction loss. GAZSL [211] utilises noisy textual descriptions from Wikipedia to generate visual features. A visual pivot regulariser is introduced to help generate features with better qualities. f-CLSWGAN [187] combines three conditional GAN variants for a better data generation. f-VAEGAN-D2 [189] combines the architectures of conditional VAE [158] , GAN [48] and a non-conditional discriminator for the transductive setting. LisGAN [87] generates unseen features from random noises using conditional Wasserstein GANs [9] . For regularisation, they introduced semantically meaningful soul samples for each class and forced the generated features to be close to at least one of the soul samples. Gradient Matching Network (GMN) [143] trains an improved version of the conditional WGAN [51] to produce image features for the novel classes. It also introduces Gradient Matching (GM) loss to improve the quality of the synthesised features. In order to synthesise unseen features, SPF-GZSL [86] selects similar instances and combines them to form pseudo features using a centre loss function [182] . In Don't Even Look Once (DELO) by [209] a detection algorithm is conducted to synthesise unseen visual features to gain high confidence predictions for unseen concepts while maintaining low confidence for backgrounds with vanilla detectors.

Instead of augmenting data using synthesising methods, data can be acquired by gathering web images. [112] jointly use web data which are considered weakly-supervised categories alongside the fully-supervised auxiliary labelled categories. It then learns a dictionary for the two categories.

Several works make use of both visual and semantic projections to reconstruct better semantics to confront domain shift issue by alleviating the contradiction between the two domains. Semantic AutoEncoder (SAE) [72] adds a visual feature reconstruction constraint. It combines linear visual-tosemantic (encoder) and linear semantic-to-visual (decoder). SP-AEN [24] is a supervised Adversarial AutoEncoder [101] which improves preserving the semantics by reconstructing the images from the raw 256 x 256 x 3 RGB colour space. BSR [153] uses two different semantic reconstructing regressors to reconstruct the generated samples into semantic descriptions. CANZSL [26] combines feature-synthesis with semantic embedding by using a GAN for generating visual features and an inverse GAN to project them into semantic space. In this way, the produced features are consistent with their corresponding semantics.

Some of the synthesising approaches utilise a common latent space to align the generated features space with the semantic space to facilitate capturing the relations between the two spaces. [97] introduce a latent-structure-preserving space Table 1 . Common ZSL and GZSL methods categorised based on their embedding space model, with further divisions in a top-down manner.

Semantic Embedding Two-Step Learning Attributes classifiers DAP-Based [79] , [136] , [135] , [80] , [8] , [12] IAP-Based [79] , [65] , [80] , [113] Bayesian network (BN) [179] , Random Forest Model [59] , HEX Graph [31] Direct Learning

Linear [114] , [43] , [6] , [7] , [137] , [123] , [17] , [12] , [190] , [38] , [197] , [171] or Non-Linear [71] , [84] , [185] , [23] , [159] , [171] Compatibility Functions Explicit knowledge representation Graph Conv. Networks (GCN) [181] , Knowledge Graphs [83] , [134] , [91] , [63] , 3-Node Chains [210] , Matrix Tri-Factorisation with Manifold Regularisation [193] Cross-Modal Latent Embedding

Fusion-based Models Fusion of seen class data Combination of seen classes properties [204] , [22] , [93] , Combination of seen class scores [192] Common Representation Space Models

Mapping of the visual and semantic spaces in a joint intermediate space

Parametric [44] , [100] , [4] , [108] , [205] , [61] , [73] , [138] , Non-parametric [164] , [109] , [94] , or Both [176] Visual Embedding

Learning of the semantic to visual projection

Linear [37] , [207] , [52] or Non-linear [36] , [202] , [96] , [60] , [174] Projection functions Data Augmentation

Gaussian distribution [175] , [90] , GAN [162] , VAE [75] Visual feature generation GAN [42] , [211] , [87] , WGAN [187] , [143] , CVAE [106] , [209] , VAE+GAN [189] , GMMN [18] , Similar feature combination [86] Leveraging Web Data Web images crawling Dictionary learning [112] Hybrid Visual+Semantic Embedding

AutoEncoder [72] , Adversarial AutoEncoder [24] , GAN with two reconstructing regressors [153] , GAN an inverse GAN [26] Visual+Cross Modal Embedding Feature generation with aligned semantic features Semantic to visual mapping [97] , VAE [145] All The use of generator and discriminator together with the regressor GAN + Dual Learning [58] where synthesised features from given attributes would suffer less from bias and variance decay with the help of Diffusion Regularisation. CADA-VAE [145] generates a visual feature latent space where both of visual and semantic features are embedded in this space by a VAE [69] . It uses Distribution Alignment (DA) loss and Cross-Alignment (CA) loss to align the cross-modal latent distributions.

GDAN [58] combines all three approaches and designs a dual adversarial loss. In this way, regressor and discriminator learn from each other.

A summary of the different approaches is reported in Table 1 . The number of methods are growing with time and we can interpret that some areas like direct learning, common space learning and visual data synthesising are more popular in solving the task, while models combining different approach are fairly newer techniques thus have fewer works that are reported here.

In this section, we review some of the standard evaluation techniques to analyse the performance of the ZSL techniques based on the common benchmark datasets in the field, also in terms of dataset splits, class embeddings, image embeddings, and various evaluation metrics. First, the benchmark datasets.

There are several well-known benchmark datasets for Zero-shot learning that are frequently used. [168] is a fine-grained dataset of birds consisting of 1,011 classes and 48,562 images. Images are categorised based on their visual attributes. A new version of this dataset is proposed by [38] in which the identical leaf nodes are merged to their parent nodes where their only differences were genders and resulted in final 404 classes.

Attribute datasets. SUN Attribute [118] is a mediumscale and fine-grained attribute dataset consisting of 102 attributes, 717 categories and a total of 14,340 images of different scenes. CUB-200-2011 Birds (CUB) [173] is a 200 category fine-grained attribute dataset with 11,788 images of bird species that includes 312 attributes. Animals with Attributes (AWA1) [80] is another attribute dataset of 30,475 images with 50 categories and 85 attributes, the image features in this dataset are licensed and not available publicly. Later, Animals with Attributes 2 (AWA2) was presented by [188] which is a free version of AWA1 with more images than the previous one (37,322 images), with the same number of classes and attributes, but different images. aPascal and Yahoo (aPY) [40] is a dataset with a combination of 32 classes, including 20 pascal and 12 yahoo attribute classes with 15,339 images and 64 attributes in total.

A summary of the statics for the attribute datasets are gathered in Table 2 .

ImageNet. ImageNet [32] is a large-scale dataset that contains 14 million images, shared between 21k categories with each image having one label that makes it a popular benchmark to evaluate models in real-world scenarios. Its organisation is based on WordNet hierarchy [105] . ImageNet is imbalanced between classes as the number of samples in each class vary greatly and is partially fine-grained. A more balanced version has 1k classes with 1000 images in each category.

There are several approaches in FSL setting for COVID-19 diagnosis, however ZSL is still new in the field of disease recognition, we introduce a dataset suited for the task of ZSL/GZSL that contains the required image and textual descriptions in one place. [27] is a small and public dataset of CXR and CT scans suitable for ZSL and Few-shot learning experiences. At the time of this research, it had 444 unique clinical notes for a total of 16 categories, from no finding (normal cases) to other pneumonic cases like COVID-19, MERS, and SARS.

Here we discuss the original splits of the datasets as well as the other splits proposed for the Zero-shot problem.

In ZSL problems, unseen classes should be disjoint to seen classes and test time samples limited to unseen classes, thus the original splits aim to follow this setting. SUN [118] proposed to use 645 classes for training among which 580 of the classes are used for training, 65 classes are for validation and the remaining 72 classes will be used for testing. For CUB, [6] introduces the split of 150 training classes (including 50 validation classes) and 50 test classes. As for AWA1, [80] introduced the standard split of 40 classes for training (13 validation classes) and 10 classes for testing. The same splits are used for AWA2. In aPY, 20 classes of Pascal are used for training (15 classes for training and 5 for validation), while the 12 classes of Yahoo are used for testing.

Proposed Splits (PS). The standard split images from SUN, CUB, AWA1 and aPY overlap with some images of pre-trained ResNet-101 ImageNet model. To solve the problem, proposed splits (PS) is introduced by [186] where no test images are contained in the ImageNet 1K dataset.

ImageNet. [186] proposes 9 ZSL splits for the ImageNet dataset; two of which evaluate the semantic hierarchy in distance-wise scales of 2-hops (1509 classes) and 3-hops (7678 classes) from the 1k training classes. The remaining six splits consider the imbalanced size of classes with increasing granularity splits starting from 500, 1K and 5K least-populated classes to 500, 1K and 5K most-populated classes, or All which denotes a subset of 20k other classes for testing.

To measure the relatedness of seen samples to unseen classes, [38] introduces two splits Super-Category-Shared (SCS) and Super-Category-Exclusive (SCE). SCS is the easy split since it considers the relatedness to the parent category while SCE is harder and measures the closeness of an unseen sample to that particular child node.

There exist several class embeddings, each suitable for a specific scenario. Class embeddings are in forms of vectors of real numbers which can further be used to make predictions based on the similarity between them and can be obtained through three categories: attributes, word embeddings, and hierarchical ontology. The last two are done in an unsupervised manner thus do not require human labour.

Human annotated attributes are done under the supervision of experts with a great amount of effort. Binary, relative and real-valued attributes are three types of attributes embeddings. Binary attributes depict the presence of an attribute in an image thus value is either 0 or 1. They are the easiest type and are provided in benchmark attribute datasets AWA1, AWA2, CUB, SUN, aPY. Relative attributes [115] on the other hand, show the strength of an attribute in a given image comparing to the other images. The real-valued attributes are in continuous form thus they have the best quality [7] . In the SUN attribute dataset [118] , they have achieved confidence through averaging the binary labels from multiple annotators.

Also known as Textual corpora embedding. Bag of Words (BOW) [54] is a one-hot encoding approach. It simply shows the number of occurrences of the words in a representation J o u r n a l P r e -p r o o f called bag and is negligent of word orders and grammar. One-hot encoding approaches had a drawback of giving the stop words (like "a", "the" and "of") high relevancy counts. Later, Term Frequency-Inverse Document Frequency (TF-IDF): [142] used term weighting to alleviate this problem by filtering the stop words and to keep meaningful words. Word2Vec [103] , a widely used two-layered neural embedding model and has two variants, CBOW and skip-gram. CBOW predicts a target word in the centre of a context using its surroundings words while the skip-gram model predicts surrounding words using a target word. CBOW is faster in train and usually results in better accuracy for frequent words while Skip-gram is preferred for rare words and it works well with sparse training data. Global Vectors (GloVe) [119] is trained on Wikipedia. It combines local context window methods and global matrix factorization. Glove learns to consider global word-word co-occurrence matrix statistics to build the word embeddings.

WordNet [105] is a large-scale public lexical database of 117,000 synsets. Synsents are a group of words that are semantically related to each other. i.e. synonyms, homonyms and meronymies of English words that are organised using the hierarchy distances with a graph structure, thus Approaches based on knowledge graphs often follow the WordNet to measure the similarity between the word meanings [136] , [135] , [5] , [7] , [185] , [100] , [4] , [123] , [102] , [181] , [83] .

In the general ZSL scenarios, word by word representations considered; however, with the advent of transfer learning in the natural language processing (NLP), and the introduction of contextual word embeddings, the boundaries of the capabilities of the embeddings has been pushed further. Unlike the traditional word embeddings, language models can capture the meaning of the words based on the context in which they appear. Several contextual representations that have been introduced recently and showed great results. These existing pre-trained models can be fined tuned on various ZSL tasks.

ELMo [120] is a contextual embedding model. Following morphological clues together with a deep bidirectional language model (biLM), ELMo learns the representations. Bidirectional Encoder Representations from Transformer (BERT) [33] is a multi-layer bidirectional Transformer encoder [170] trained upon BooksCorpus [212] dataset and English Wikipedia. It outperforms ELMo with having more parameters and layers.The pre-trained BERT model can be fine-tuned with just one additional output layer. However, BERT suffers from fine-tuning discrepancy due to ignoring the relation the masked positions have. XLNet [196] uses an autoregressive model to introduce a method that overcomes the shortcoming of BERT. In addition to the datasets used by BERT, XLNet pre-trains the model on Giga5 [116] , ClueWeb 2012-B extended by [20] and Common Crawl * . ALBERT [81] increases the model size. It lowers the memory usage with two parameter reduction techniques. The first one is a factorized embedding parameterization. The second one is cross-layer parameter sharing. These two techniques result in lower memory usage and higher training speed than BERT. The data used for pre-training is the same as XLNeT.

In this article, we report the results of ZSL and GZSL using the same class embeddings as [186] that is Word2Vec trained on Wikipedia for ImageNet and per-class attributes for the attribute datasets, and for the seen-unseen relatedness task we follow [38] and consider TF-IDF for the CUB and NAB datasets.

Existing models use either shallow or deep feature representation. Examples of shallow features are SIFT [99] , PHOG [16] , SURF [15] and local self-similarity histograms [148] . Among the mentioned features, SIFT is the commonly used features in ZSL models like [6] , [22] and [44] .

Deep features are obtained from deep CNN architectures [161] and contain higher-level features. Extracted features are one of the followings: 4,096-dim top-layer hidden unit activations (fc7) of the AlexNet [74] , 1000-dim last fully connected layer (fc8) of VGG-16 [155] , 4,096-dim of the 6th layer (fc6) and 4,096-dim of the last layer (fc7) features of the VGG-19 [155] . 1,024-dim top-layer pooling units of the GoogleNet [160] . and 2048-dim last layer pooling units of the ResNet-101 [55] .

In this paper, we consider the ResNet-101 network which is pre-trained on ImageNet-1K without any fine-tuning. That is the same image embedding used in [186] . Features are extracted from whole images of SUN, CUB, AWA1, AWA2, and ImageNet and the cropped bounding boxes of aPY. For the seen-unseen relatedness task, VGG-16 is used for CUB * http://commoncrawl.org J o u r n a l P r e -p r o o f and NAB as proposed in [38] .

Common evaluation criteria used for ZSL challenge are:

Classification accuracy. One of the simplest metrics is classification accuracy in which the ratio of the number of the correct predictions to samples in class is measured. However, it results in a bias towards the populated classes.

Average per-class accuracy. To reduce the bias problem for the populated classes, average per-class accuracies are computed by multiplying the division of the classification accuracy to division of their cumulative sum.

#correct predictions in class #samples in class [188] (13)

Harmonic mean. For performance evaluation on both seen and unseen classes (i.e. the GZSL setting), the Top-1 accuracies for the seen and unseen classes are used to compute the harmonic mean:

In this paper, we designate the Top-1 accuracies and the harmonic mean as the evaluation protocols.

As the main contributions of this research, and for the first time, we provide a comprehensive experiments of 21 state-of-the-art models in ZSL/GZSL domain that include the evaluations and comparisons of data-synthesising methods. In this section, first we provide the results for ZSL, GZSL and seen-unseen relatedness on attribute datasets, then we present the experimental results on the ImageNet dataset. A minor part of the results is reported from [188] for a more comprehensive comparison.

For the original ZSL task where only unseen classes are being estimated during the test time, we compare 21 state-of-the-art models in Table 3 , among which, DAP [80] , IAP [80] and CONSE [113] belong to attribute classifiers. CMT [157] , LATEM [185] , ALE [6] , DeViSE [43] , SJE [7] , ESZSL [137] , GFZSL [171] and DSRL [197] are from compatibility learning approaches, SSE [204] and SYNC [22] are representative models of cross-modal embedding, DEM [202] , GAZSL [211] , f-CLSWGAN [187] , CVAE-ZSL [106] , SE-ZSL [75] are visual embedding models. From the hybrid or combination category, we compare the results of SAE [72] . Three transductive approaches ALE-tran [6] , GFZSL-tran [171] and DSRL [197] are also presented among the selected models. Due to the intrinsic nature of the transductive setting, the results are competitive and in some cases better than the inductive methods, i.e. for GFZSL-tran [171] the accuracy is 9.9% higher than CVAE-ZSL [106] [171] , a compatibility-based approach, has the best scores compared to other models of the same category in every dataset except for the CUB where SJE [7] tops the results in both splits. This superiority could be due to the generative nature of the model. GFZSL [171] performs the best on AWA1 both in inductive and transductive settings. Out of cross-modal methods, SYNC [22] performs better than SSE [204] in SUN and CUB datasets, while for AWA1, AWA2 and aPY in SS split it has lower performance than SSE [204] in the proposed split. Visual generative methods have proved to perform better as they make the problem into the traditional supervised form, among which, SE-ZSL [75] has the most outstanding performance. For the proposed split in one case on CUB dataset, SE-ZSL [75] performs better than ALE-tran [6] which is its transductive counterpart where the accuracies are 59.6% vs 54.5%. In PS split of AWA1, CVAE-ZSL [106] stays at the top, with 1.9% higher accuracy than the second-best performing model. The accuracies for SS splits are higher than PS in most cases and the reason could be the test images included in training samples, especially for AWA1 and AWA2, as reported in [186] .

A more real-world scenario where previously learned concepts are estimated alongside new ones is necessary to experiment. 21 state-of-the-art models, same as ZSL challenge, include: DAP [80] , IAP [80] , CONSE [113] , CMT [157] , SSE [204] , LATEM [185] , ALE [6] , DeViSE [43] , SJE [7] ,ESZSL [137] , SYNC [22] , SAE [72] , GFZSL [171] , DEM [202] , GAZSL [211] , f-CLSWGAN [187] , CVAE-ZSL [106] , SE-GZSL [75] , ALE-train [6] , GFZSL-tran [171] , DSRL [197] . CADA-VAE [145] is added to the comparison as a model combining the visual feature augmentation approach with the cross-modal alignment. CMT* [157] has a novelty detection and is included in the report as an alternative version to CMT [157] . The reports in Table 4 are in PS splits. As shown in the table, the results on are dramatically higher than since in GZSL, the test search space includes seen classes as well as unseen classes, this gap is the most conspicuous in attribute classifiers like DAP [80] that performs poorly on AWA1 and AWA2, hybrid approaches and in GFZSL [171] where it results in 0% accuracy on SUN and CUB when training classes are estimated at test time. However for three models f-CLSWGAN [187] , SE-GZSL [75] and CADA-VAE [145] in SUN dataset, the accuracy for is higher than , i.e. for SE-GZSL [75] it is 10.4% higher. For a fair comparison, the weighted average of training and test classes is also reported. According to J o u r n a l P r e -p r o o f harmonic means, the best model on all evaluated datasets is SE-ZSL [75] , although the results haven't been reported for aPY. In some cases, the attribute classifier achieves the best results on . Transductive models have fluctuating results in comparison with their inductive types. CADA-VAE [145] achieves the best performance in all of the harmonic means cases (results for aPY are not reported) and shows the best results, higher than all of the transductive methods.

For fine-grained problems, sometimes it is important to measure the closeness of previously known concepts to novel unknown ones. For this purpose, a total of eleven models are compared in Table 5 . MCZS [4] , WAC-Linear [37] , WAC-Kernel [36] , ESZSL [137] , SJE [7] , ZSLNS [123] , SynC [22] , SynC [22] , ZSLPP [38] , GAZSL [211] and CANZSL [26] . SCE is the hard split thus has lower results compared to the SCS splits. The two variations reported for SYNC [22] model, SynC denotes the setting in which the standard Crammer-Singer loss is used, and SynC [22] depicts setting with one-versus-other classifiers. The first setting has better accuracies on CUB. CANZSL [26] outperforms all other models in both datasets and splits and improves the accuracy by 4% from 10.3% to 14.3% on SCE split of the CUB dataset and 35.6% vs 38.1% in SCS splits of NAB compared to the next best performing model is GAZSL [211] . Similar to previous experiments, in the seen-unseen relatedness challenge, models that contain feature generating steps have the highest results.

ImageNet is a large-scale single-labelled dataset with an imbalanced number of data that possesses WordNet hierarchy instead of human-annotated attributes, thus is useful mean to measure the performance of various methods in recognitionin-the-wild scenarios. The performances of 12 state-of-the-art models are reported here. They are CONSE [113] , CMT [157] , LATEM [185] , ALE [6] , DeViSE [43] , SJE [7] ,ESZSL [137] , SYNC [22] , SAE [72] , f-CLSWGAN [187] , CADA-VAE [145] and f-VAEGAN-D2 [189] . All of the Top-1 accuracies, except for the data generating models are reported from [186] experiments. As it can be understood from Figure 4b , Feature generating methods have outstanding performance compared to other approaches. Although the results of f-VAEGAN-D2 [189] are available only for 2H, 3H and all splits, it still has the highest accuracies among other models. SYNC [22] and f-CLSWGAN [187] are the next best performing models with approximately the same accuracies. CONSE [113] is a representative model from attribute-classifier based models, as it is also superior to direct compatibility approaches. ESZSL [137] , a model with linear compatibility function outperforms the other model within its category. However, in one case, SJE [7] has slightly better accuracy in L500 split setting. It can be interpreted from the figures that on coarse-grained classes, the results are conspicuously better, while fine-grained classes with few images per class have more challenges. However, if the test search space is too big then the accuracies decrease. i.e. M5K has lower accuracies compared to L500 splits, and on 20K split, it is the lowest.

The GZSL results are important in the way that they depict the models' ability to recognise both seen and unseen classes at the test time. The results for the SYNC [22] model is only reported in the L5K setting. As shown in Figure 4b , the trend is Similar to ZSL where populated classes have better results than the least populated classes, yet have poor results if the search spaces become too big like the decreasing trends in most and least populated classes. Moreover, data-generating approaches dominate other strategies. CADA-VAE [145] that has the advantages of both cross-modal alignment and data feature synthesising methods, evidently outperforms other models. In one case, i.e M500, it nearly has double the accuracy of f-CLSWGAN [187] . For the semantic embedding category, although ESZSL [137] had better results on ZSL, it falls behind approaches like ALE [6] , DeViSE [43] and SJE [7] .

During the very recent years, zero-shot learning has proved to be a necessary challenge to-be-solved for different scenarios and applications. The number of demands for learning without accessing to the unseen target concepts is also increasing each year.

Zero-shot learning is widely discussed in the computer vision field, such as object recognition in general, as in [133] and [140] where they aim to locate the objects beside recognising them. Several other variations of ZSL models are proposed for the same task purpose such as [13] , [126] and [30] . Zero-shot emotion recognition [200] has the task of recognising unseen emotions while zero-shot semantic segmentation aims to segment the unseen object categories [19] and [177] . Moreover, on the task of retrieving images from a large scale set of data, Zero-shot has a growing number of research [98] [194] along with sketch-based image retrieval systems [35] , [34] and [150] . Zero-shot learning has an application on visual imitation learning to reduce human supervision by automating the exploration of the agent [117] , [82] . Action recognition is the task of recognising the sequence of actions from the frames of a video. However, if the new actions are not available when training, Zero-shot learning can be a solution, such as in [45] , [124] , [107] and [149] . Zero-shot Style Transfer in an image is the problem of transferring the texture of source image to target image while the style is not pre-determined and it is arbitrary [151] . Zero-shot resolution enhancement problem aims at enhancing the resolution of an image without pre-defined high-resolution images for training examples [154] . Zero-shot scene classification for HSR images [85] and scene-sketch classification has been studied in [191] as other applications of ZSL in computer vision. Zero-shot learning has also left its footprint in the area of NLP. Zero-Shot Entity Linking, links entity mentions in the text using a knowledge base [95] . Many research works focus on the task of translating languages to another without predetermined translation between pairs of samples [50] , [62] , [53] , [78] . In sentence embedding [10] and in Style transfer of text, a common technique is to convert the source to another style via arbitrary styles like the artistic technique discussed in [21] . In the audio processing field, zero-shot based voice conversion to another speaker's voice [122] is an applicable scenario of ZSL.

In the era of the COVID-19 pandemic, many researchers have tried to work on Artificial Intelligence and Machine learning based methodologies to recognise the positive cases of the COVID-19 patients based on the CT scan images or Chest X-rays. Two prominent features in chest CT used for diagnosis are ground glass opacities (GGO) and consolidation which has been considered by some of the researchers such as [39] , [198] , [92] , and [146] . [111] uses three CNN model to detect COVID-19, in which the ResNet50 shows a very high rate of classification performance. [146] introduces a deep-learning based system that segments the infected regions and the entire lung in an automatic manner. [184] shows that the increase in unilateral or bilateral Procalcitonin and consolidation with surrounding halo is prominent in chest CT of paediatric patients. [88] introduces the COVNet to extract the 2D local and 3D global features in 3D chest CT slices. The method claims the ability of classifying COVID-19 from community acquired pneumonia (CAP). [152] shows different imaging patterns of the COVID-19 cases depending on the time of infection. [208] classifies four stages to respiratory CT scan changes and shows the most dramatic changes to be in the first 10 days from the onset of initial symptoms. [201] introduce a deep learning based anomaly detection model which extracts the high-level features from the input chest X-ray image. [56] introduce COVIDX-Net to classify the positive cases for the COVID-19 in X-ray images. It uses 7 different architectures, which VGG19 outperforms the others. [3] propose a COVID-CAPS that is based on the Capsule Networks [57] to avoid the drawbacks of CNN-based architectures as it captures better spatial information. It performs on a small dataset of Xray images. [1] employ a class decomposition mechanism in DeTraC [2] which is a deep convolutional network that can handle image dataset irregularities of the X-ray images. Zhang et al. [203] propose a method for X-ray medical image segmentation using task driven generative adversarial networks. [128] proposes a 21-layer CNN called CheXNet, trained on the ChestX-ray14 dataset [180] to detect pneumonia with the localisation of the most infected areas from the X-ray images. [139] shows a possible diagnostic criteria could be the existence of bilateral pulmonary areas of consolidation found in the chest X-rays, and [169] use DenseNet-169 for the purpose of feature extraction followed by an SVM classifier to detect Pneumonia from chest X-ray images.

A common weakness among the majority of the abovementioned research works is that they either conduct their evaluations on a very limited number of cases due to the lack of comprehensive datasets (which puts the validity of the reported results under a question), or they suffer from underlying uncertainties due to unknown nature and characteristics of the novel COVID-19, not only for the medical community, but also for the machine learning and data analytic experts. In such an uncertain atmosphere with limited training dataset, we strongly recommend the adaptation of Zero-shot learning and its variances (as discussed in Figure 4 ) as an efficient deep learning based solution towards COVID-19 diagnosis.

Diagnosis and recognition of the very recent and global challenge of COVID-19 disease caused by the severe acute respiratory syndrome Coronavirus 2 (SARS-CoV-2) is a perfect real-world application of Zero-shot learning, where we do not have millions of annotated datasets available; and the symptoms of the disease and the chest X-ray of infected people J o u r n a l P r e -p r o o f may significantly vary from person to person. Such a scenario can truly be considered as a novel unseen target or classification challenge. We only know some of the symptoms of the infected people with COVID-19 in forms of advices, text notes, chest X-ray interpenetration, all as the auxiliary data which have partial similarities with other lung inflammatory diseases, such as Asthma or SARS. So, we have to seek for a semantic relationship between training and the new unseen classes. Therefore, ZSL can help us significantly to cope with this new challenge like the induction of the SARS-CoV-2, from previously learned diagnosis of the Asthma, and the Pneumonia using written medical documents of the respiratory tracts and chest X-ray images. In the case of the few-shot learning, a handful of the chest CT scans or X-ray of the positive cases of the COVID-19 can also be beneficial as further supportset alongside the chest X-ray images of SARS, Asthma and Pneumonia to infer the novel COVID-19 cases.

As a general rule and based on the recent successful applications, we can infer that in any scenarios that the goal is set to reduce supervision, and the target of the problem can be learned through side information and its relation to the seen data, the Zero-shot learning method can be conducted as one of the best learning techniques and practices.

A typical zero-shot learning problem is usually faced with three popular issues that need to be solved in order to enhance the performance of the model. These issues are Bias, Hubness and domain shift; and every model revolves around solving one or more of the issues mentioned. In this section, we discuss efforts done by different approaches to alleviate bias, hubness and domain-shift and infer the logic each approach owns to learn its model.

Bias. The problem with ZSL and GZSL tasks is that the imbalanced data between training and test classes cause a bias towards seen classes at prediction time. Other reasons for bias could be high-dimensionality and the devoid of manifold structure of features. Several data generating approaches have worked on alleviating bias by synthesising visual data for unseen classes. [187] generate semantically rich CNN features of the unseen classes to make unseen embedding space more known. [106] generates pseudo seen and unseen class features, and then it trains an SVM classifier to mitigate bias. [143] improve the quality of the synthesised examples by using gradient matching loss. Models combining data generation or reconstruction along with other techniques have proved to be effective in alleviating bias. [97] use an intermediate space to help discover the geometric structure of the features that previously didn't with the regression-based projections. [24] used calibrated stacking rule. [145] generated latent feature sizes of 64 with the idea that low-dimensional representations tend to mitigate bias. [153] uses two regressors to calculate reconstruction to diminish the bias. Transductive-based approaches like [143] are also used to solve the bias issue. In [159] , it forces the unseen classes to be projected into fixed pre-defined points to avoid results with bias.

Hubness [125] . In large-dimensional mapping spaces, samples (hubs) might end up falsely as the nearest neighbours of several other points in the semantic space and result in an incorrect prediction. To avoid the hubness, [176] propose a stage-wise bidirectional latent embedding framework. When a mapping is done from high-dimensional feature space to a low-dimensional semantic space using regressors, the distinctive features will partially fade while in the visual feature space, the structures are better preserved. Hence, the visual embedding space is well-known for mitigating the hubness problem. [174] and [202] use the output of the visual space of the CNN as the embedding space.

Domain-shift. Zero-shot learning challenge can be considered as a domain adaptation problem. This is because the source labelled data is disjoint with the target unlabelled domain data. This is called project domain-shift. Domain adaptation techniques are used to learn the intrinsic relationships among these domains and transfer knowledge between the two. A considerable amount of works has been done through a transductive setting which has been successful to overcome the domain-shift issue. [44] a multi-view embedding framework, performs label propagation on graph a heuristic one-stage self-learning approach to assign points to their nearest data points. [71] introduces a regularised sparse coding based unsupervised domain adaptation framework that solves the domain shift problem. [206] use a structured prediction method to solve the problem by visually clustering the unseen data. [174] use a visual constraint on the centre of each class when the mapping is being learned. Since the pure definition of the ZSL challenge is the inaccessibility of unseen data during training, several inductive approaches tried to solve the problem as well. [72] propose to reconstruct the visual features to alleviate this issue. [197] perform sparse non-negative matrix factorisation for both domains in a common semantic dictionary. MFMR [193] exploits the manifold structure of test data with a joint prediction scheme to avoid domain shift. [138] use entropy minimisation in optimisation. [86] preserve the semantic similarity structure in seen and unseen classes to avoid the domain-shift occurrence. [87] mitigates projection domain-shift by generating soul samples that are related to the semantic descriptions.

These three common issues together with inferiorities of each methods will be a motivation to decide on a particular approach when solving the ZSL problem. Attribute classifiers are considered customised since human-annotations are used; however, this makes the problem a laborious task that has strong supervision. Compatibility learning approaches have the ability to learn directly by eliminating the intermediate step but often face with the bias and hubness problem. Manifold learning solves this weakness of the semantic learning approaches by preserving the geometrical structure of the features. Cross-modal latent embedding approaches take on a different point of view and leverage both visual and semantic features and the similarity and differences between them. They often propose methods for aligning the structures between the two modes of features. This category of methods also suffers from the hubness problem for the problems dealing with highdimensional data. Visual space embedding approaches have the advantage of turning the problem into a supervised one by generating or aggregating visual instances for the unseen classes. Plus are a favourable approach for solving hubness problem due to the high-dimensionality of the visual space that can preserve information structure better and also bias problem by alleviating the imbalanced data by generating unseen class samples. Here a challenge would be generating more realistic looking data. Another different setting is transductive learning that present solutions to bias problem, by creating balance in data by gathering unseen data, yet not applicable to many of the real-world problems since the original definition of ZSL limits the use of unseen data during the training phase.

Depending on the real-world scenarios, each way of solving the problem might be the most appropriate choice. Some approaches improve the solution by combining two or more methods to benefit from each one's strengths.

In this article, we performed a comprehensive and multifaceted review on the Zero-Shot/Generalised Zero-shot Learning challenge, its fundamentals, and variants for different scenarios and applications such as COVID-19 diagnosis, Autonomous Vehicles, and similar complex real-world applications which involve fully/partially new concepts that have never/rarely seen before, besides the barrier of limited annotated dataset. We divided the recent state-of-the-art methods into four space-wise embedding categories. We also reviewed different types of side and auxiliary information. We went through the popular datasets and their corresponding splits for the problem of ZSL. The paper also contributed in performing the experiment results for some of the common baselines and elaborated on assessing the advantages and disadvantages of each group, as well as the ideas behind different areas of solutions to improve each group. Our evaluation reveals that data synthesis methods and combinational approaches yield the best performance, as by synthesizing data, the problem shifts to the classic recognition/diagnosis problem, and by combining other methods, the model utilises the advantage of each embedding techniques. The models even outperform compatibility learning models in transductive setting. This means, the models consisting a visual data generation step, lead to better results than other approaches and settings. Furthermore, the accuracies improve when the unseen classes have closer semantic hierarchy and relatedness distance to the seen classes. Finally, we reviewed the current and potential real-world applications of ZSL and GZSL in the near future. To the best of our knowledge, such a comprehensive and detailed technical review and categorisation of the ZSL methodologies, alongside with an efficient solution for the recent challenge of COVID-19 pandemic is not done before; hence, we expect it to be helpful in developing new research directions among AI and health-related research community.

The NMF is computed as follows:

Here, and are dictionary and the latent representation of matrix respectively. and are the hyperparameters.

SAE [72] or Semantic Auto Encoder, uses an AutoEncoder combines two linear mapping functions, one for the visual space and the other one for semantic space. In this way, the decoded visual feature, produces semantically meaningful features after the mapping to the semantic space. The objective to be minimised is as follows:

where and are decoder and encoder projection matrices. And is a hyperparameter. [37] combines a regression function that solely maps semantic features to the visual space, and a knowledge transfer function, to map the textual descriptions to the visual space.

Some approaches use non-linear compatibility functions to solve the ZSL challenge.

CMT [157] use a two-layer neural network, similar to common MLP networks by [131] that minimises the objective function L = ∈ ( ) ∈ ( ) − (2) ℎ (1) ( ) 2 (A.10)

= ( (1) , (2) ).

GFZSL [171] introduces both linear and non-linear regression models in a generative approach as it produces a probability distribution for each class. It then uses MLE for estimating seen class parameters and two regression functions for unseen categories.

= ( ( )) (A.11) 2 = 2 ( ( )) (A. 12) where is the Gaussian mean vector and is the diagonal Covariance matrix of the attribute vector. In its transductive setting, it uses Expectation-Maximisation (EM) that works like estimation a Gaussian Mixture Model (GMM) of unlabelled data in an iterative manner. The inferred labels will be included in the next iterations.

LATEM [185] learns several mappings and selects one to be the latent variable for a pair of image and class. The selected latent embedding learns a piece-wise non-linear compatibility function alongside a ranking loss. Its compatibility function is DEM [202] and WAC-Kernel [36] learn non-linear mapping in the inverse direction from different types of class embeddings. i.e. textual data. WAC-Kernel uses a kernel method for the integration of side information. The objective function for DEM is L = ( ) − ( ) 2 + 2 (A.14)

that looks like a ridge regression.

Some of the methods consider cross-modal feature similarity in a mutual space.

SSE [204] learns two embedding functions, one being which is learned from seen class auxiliary information and the other one from seen data which is target class embedding and predicts unseen labels via maximising the similarity between histograms: L = argmax ∈ ( ( )) ( ( )) (A.15) SYNC [22] considers the mapping between the semantic space of the external information and the model space. it introduces phantom classes to align the two spaces. The classifier is trained with the sparse linear combination of the classifiers for the phantom classes: 16) where and are weighted graphs of the real and phantom classes respectively. While is the bipartite graph of those to previously graph combinations.

MCZSL [4] combines compatibility learning with Deep Fragment embeddings [67] in a joint space. Their visual part and multi-cue language embedding are defined as follows, respectively: where is the parameters of the two encoders and is the hyperparameter.

Several methods decide to generate images or image features using different visual data synthesis techniques. some of them are VAE-based [69] . [145] learns latent space features and class embedding by training VAE [69] for both visual and semantic modalities. used Cross-Alignment (CA) Loss to align latent distributions in cross-modal reconstruction: and are predictions of the encoder. [75] adds an extra loss term named "feedback loss" to the VAE objective [69] function that works as a discriminator to enforce the similarity of the generated sample to the original distribution. The regressor feedback term is as follows:

where is a random noise. [106] uses a Conditional variations AutoEncoder (CVAE) [158] , conditioned on attributes, alongside the 2 norm for reconstruction loss. The objective function of a CVAE is:

where is the condition. It then trains a SVM classifier [28] for unseen categories.

Other approaches introduce GAN-based [48] methods.

GAZSL [211] adds a visual pivot regulariser to GAN's objective function. This regulariser aims to reduce the noise of Wikipedia articles. f-CLSWGAN [187] combines three conditional GAN [48] variants: GAN, conditional WGAN [51] and a classification loss, and name their method f-CLSWGAN.

The classification loss is like a regulariser for the enhancement of the generated features and is a hyperparameter.

uses GAN for generating visual features and an inverse GAN to project them back to the semantic space. In this way, the produced features are consistent with their corresponding semantic features.

f-VAEGAN-D2 [189] introduces a generative model that integrates VAE and WGAN. In this model, the decoder of VAE and the generator of the WGAN are the same component, and there are two discriminators ( 1 , 2 ) for this model. The full objective function to be optimised is as follows:

Classification of covid-19 in chest x-ray images using detrac deep convolutional neural network

Detrac: Transfer learning of class decomposed medical images in convolutional neural networks

Covid-caps: A capsule network-based framework for identification of covid-19 cases from x-ray images

Multi-cue zero-shot learning with strong supervision

Labelembedding for attribute-based classification

Labelembedding for image classification

Evaluation of output embeddings for fine-grained image classification

Recovering the missing link: Predicting class-attribute associations for unsupervised zero-shot learning

Wasserstein generative adversarial networks

Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond

Facial age estimation using hybrid haar wavelet and color features with support vector regression

Probabilistic and-or attribute grouping for zero-shot learning

Zero-shot object detection

Cross-generalization: Learning novel classes from a single example by feature replacement

Computer vision and image understanding 110

Representing shape with a spatial pyramid kernel

Improving semantic embedding consistency by metric learning for zero-shot classiffication

Generating visual representations for zero-shot classification

Zero-shot semantic segmentation

Clueweb09 data set

Zero-shot style transfer in text using recurrent neural networks

Synthesized classifiers for zero-shot learning

Predicting visual exemplars of unseen classes for zero-shot learning

Zero-shot visual recognition using semantics-preserving adversarial embedding networks

Momentum contrastive learning for few-shot covid-19 diagnosis from chest ct images

Canzsl: Cycleconsistent adversarial networks for zero-shot learning from natural language

Covid-19 image data collection

Support-vector networks

On kernel-target alignment

Zero-shot object detection by hybrid region embedding

Large-scale object classification using label relation graphs

Imagenet: A large-scale hierarchical image database

Bert: Pretraining of deep bidirectional transformers for language understanding

Doodle to search: Practical zero-shot sketch-based image retrieval

Semantically tied paired cycle consistency for zero-shot sketch-based image retrieval

Write a classifier: Predicting visual classifiers from unstructured text

Write a classifier: Zeroshot learning using purely textual descriptions

Link the head to the" beak": Zero shot learning from noisy text description at part precision

Sensitivity of chest ct for covid-19: comparison to rt-pcr

Describing objects by their attributes

One-shot learning of object categories

Multi-modal cycle-consistent generalized zero-shot learning

Devise: A deep visual-semantic embedding model

Transductive multi-view zero-shot learning

I know the relationships: Zeroshot action recognition via two-stream graph convolutional networks and knowledge graphs

Others, 1984. A class of wasserstein metrics for probability distributions

A multi-view embedding space for modeling internet images, tags, and their semantics

Generative adversarial nets

A kernel method for the two-sample-problem

Improved zero-shot neural machine translation via ignoring spurious correlations

Improved training of wasserstein gans

Zero-shot recognition via direct classifier learning with transferred samples and pseudo labels

Effective strategies in zeroshot neural machine translation

Distributional structure

Deep residual learning for image recognition

Covidx-net: A framework of deep learning classifiers to diagnose covid-19 in xray images

Matrix capsules with em routing

Generative dual adversarial network for generalized zero-shot learning

Zero-shot recognition with unreliable attributes

IEEE International Conference on Computer Vision

Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases

Zero-shot recognition via semantic embeddings and knowledge graphs

A discriminative feature learning approach for deep face recognition

Large scale image annotation: learning to rank with joint word-image embeddings

Clinical and ct features in pediatric patients with covid-19 infection: Different points from adults

Latent embeddings for zero-shot classification

Zero-shot learning-a comprehensive evaluation of the good, the bad and the ugly

Feature generating networks for zero-shot learning

Zero-shot learning-the good, the bad and the ugly

f-vaegan-d2: A feature generating framework for any-shot learning

Attentive region embedding network for zero-shot learning

Deep zero-shot learning for scene sketch

Transductive visual-semantic embedding for zero-shot learning

Matrix tri-factorization with manifold regularizations for zero-shot learning

Attribute hashing for zero-shot image retrieval

Cross-domain matching with squared-loss mutual information

Xlnet: Generalized autoregressive pretraining for language understanding

Zero-shot classification with discriminative semantic representation learning

Chest ct manifestations of new coronavirus disease 2019 (covid-19): a pictorial review

Attribute-based transfer learning for object categorization with zero/one training example

Zero-shot emotion recognition via affective structural embedding

Covid-19 screening on chest x-ray images using deep learning based anomaly detection

Learning a deep embedding model for zero-shot learning

Unsupervised x-ray image segmentation with task driven generative adversarial networks

Zero-shot learning via semantic similarity embedding

Zero-shot learning via joint latent similarity embedding

Zero-shot recognition via structured prediction

Zero-shot learning posed as a missing data problem

Time course of lung changes at chest ct during recovery from coronavirus disease 2019 (covid-19)

Dont even look once: Synthesizing features for zero-shot detection

Generalized zero-shot recognition based on visually semantic embedding

A generative adversarial approach for zero-shot learning from noisy texts

Aligning books and movies: Towards storylike visual explanations by watching movies and reading books

Classification of covid-19 in chest x-ray images using detrac deep convolutional neural network

Detrac: Transfer learning of class decomposed medical images in convolutional neural networks

Covid-caps: A capsule network-based framework for identification of covid-19 cases from x-ray images

Multi-cue zero-shot learning with strong supervision

Labelembedding for attribute-based classification

Labelembedding for image classification

Evaluation of output embeddings for fine-grained image classification

Recovering the missing link: Predicting class-attribute associations for unsupervised zero-shot learning

Wasserstein generative adversarial networks

Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond

Facial age estimation using hybrid haar wavelet and color features with support vector regression

Probabilistic and-or attribute grouping for zero-shot learning

Zero-shot object detection

Cross-generalization: Learning novel classes from a single example by feature replacement

Computer vision and image understanding 110

Representing shape with a spatial pyramid kernel

Improving semantic embedding consistency by metric learning for zero-shot classiffication

Generating visual representations for zero-shot classification

Zero-shot semantic segmentation

Clueweb09 data set

Zero-shot style transfer in text using recurrent neural networks

Synthesized classifiers for zero-shot learning

Predicting visual exemplars of unseen classes for zero-shot learning

Zero-shot visual recognition using semantics-preserving adversarial embedding networks

Momentum contrastive learning for few-shot covid-19 diagnosis from chest ct images

Canzsl: Cycleconsistent adversarial networks for zero-shot learning from natural language

Covid-19 image data collection

Support-vector networks

On kernel-target alignment

Zero-shot object detection by hybrid region embedding

Large-scale object classification using label relation graphs

Imagenet: A large-scale hierarchical image database

Bert: Pretraining of deep bidirectional transformers for language understanding

Doodle to search: Practical zero-shot sketch-based image retrieval

Semantically tied paired cycle consistency for zero-shot sketch-based image retrieval

Write a classifier: Predicting visual classifiers from unstructured text

Write a classifier: Zeroshot learning using purely textual descriptions

Link the head to the" beak": Zero shot learning from noisy text description at part precision

Sensitivity of chest ct for covid-19: comparison to rt-pcr

Describing objects by their attributes

One-shot learning of object categories

Multi-modal cycle-consistent generalized zero-shot learning

Devise: A deep visual-semantic embedding model

Transductive multi-view zero-shot learning

I know the relationships: Zeroshot action recognition via two-stream graph convolutional networks and knowledge graphs

Others, 1984. A class of wasserstein metrics for probability distributions

A multi-view embedding space for modeling internet images, tags, and their semantics

Generative adversarial nets

A kernel method for the two-sample-problem

Improved zero-shot neural machine translation via ignoring spurious correlations

Improved training of wasserstein gans

Zero-shot recognition via direct classifier learning with transferred samples and pseudo labels

Effective strategies in zeroshot neural machine translation

Distributional structure

Deep residual learning for image recognition

Covidx-net: A framework of deep learning classifiers to diagnose covid-19 in xray images

Matrix capsules with em routing

Generative dual adversarial network for generalized zero-shot learning

Zero-shot recognition with unreliable attributes

Stacked semantics-guided attention model for fine-grained zero-shot learning

Learning class prototypes via structure alignment for zero-shot recognition

Google's multilingual neural machine translation system: Enabling zero-shot translation

Rethinking knowledge graph propagation for zero-shot learning

Fewshot object detection via feature reweighting

Online incremental attribute-based zero-shot learning

Gaze embeddings for zero-shot image classification

Deep visual-semantic alignments for generating image descriptions

Few-shot learning using a small-sized dataset of high-resolution fundus images for glaucoma diagnosis

Auto-encoding variational bayes

Siamese neural networks for one-shot image recognition

Unsupervised domain adaptation for zero-shot learning

Semantic autoencoder for zeroshot learning

Joint dictionaries for zero-shot learning

Imagenet classification with deep convolutional neural networks, in: Advances in neural information processing systems

Generalized zero-shot learning via synthesized examples

One shot learning of simple visual concepts

Human-level concept learning through probabilistic program induction

Improving zero-shot translation of low-resource languages

Learning to detect unseen object classes by between-class attribute transfer

Attribute-based classification for zero-shot visual object categorization

Albert: A lite bert for self-supervised learning of language representations

Beyond imitation: Zero-shot task transfer on robots by learning concepts as cognitive programs

Multi-label zero-shot learning with structured knowledge graphs

Predicting deep zero-shot convolutional neural networks using textual descriptions

Zero-shot scene classification for high spatial resolution remote sensing images

Generalized zero shot learning via synthesis pseudo features

Leveraging the invariant side of generative zero-shot learning

Others, 2020. Artificial intelligence distinguishes covid-19 from community acquired pneumonia on chest ct

Generative moment matching networks

Zero-shot learning with generative latent prototype model

Zero-shot recognition using dual visual-semantic mapping paths

Coronavirus disease 2019 (covid-19): role of chest ct in diagnosis and management

Discriminative learning of latent features for zero-shot recognition

Generalized zeroshot learning with deep calibration network

Zero-shot entity linking by reading entity descriptions

Towards fine-grained open zeroshot learning: Inferring unseen visual features from attributes

From zero-shot learning to conventional supervised classification: Unseen visual data synthesis

Towards affordable semantic searching: Zero-shot retrieval via dominant attributes

Distinctive image features from scale-invariant keypoints

Unsupervised learning on neural network outputs: with application in zero-shot learning

The more you know: Using knowledge graphs for image classification

Distributed representations of words and phrases and their compositionality

Learning from one example through shared densities on transforms

Wordnet: a lexical database for english

A generative model for zero shot learning using conditional variational autoencoders

A generative approach to zero-shot and few-shot action recognition

Gaussian visual-linguistic embedding for zero-shot recognition

Machine learning: a probabilistic perspective

Automatic detection of coronavirus disease (covid-19) using x-ray images and deep convolutional neural networks

Webly supervised learning meets zero-shot learning: A hybrid approach for fine-grained classification

Zero-shot learning by convex combination of semantic embeddings

Zero-shot learning with semantic output codes

Relative attributes

Zeroshot visual imitation

Sun attribute database: Discovering, annotating, and recognizing scene attributes

Glove: Global vectors for word representation

Deep contextualized word representations

Few-shot learning for dermatological disease diagnosis

Autovc: Zero-shot voice style transfer with only autoencoder loss

Less is more: zero-shot learning from online textual documents with noise suppression

Zero-shot action recognition with error-correcting output codes

Hubs in space: Popular nearest neighbors in high-dimensional data

Zero-shot object detection: Learning to simultaneously recognize and localize novel concepts

Selftraining with improved regularization for few-shot chest x-ray classification

Others, 2017. Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning

Learning deep representations of fine-grained visual descriptions

Creating a cascade of haar-like classifiers: Step by step

An efficient method for license plate localization using multiple statistical features in a multilayer perceptron neural network

Look at the driver, look at the road: No distraction! no accident!

Robust vehicle detection and distance estimation under challenging lighting conditions

Transfer learning in a transductive setting

Evaluating knowledge transfer and zero-shot learning in a large-scale setting

What helps where-and why? semantic relatedness for knowledge transfer

An embarrassingly simple approach to zero-shot learning

Zero-shot image classification using coupled dictionary embedding

Chronic eosinophilic pneumonia: A pediatric case

Object detection and localization system based on neural networks for robo-pong

International Symposium on Mechatronics and its Applications, ISMA 2008

Learning with hierarchical-deep models

Term-weighting approaches in automatic text retrieval. Information processing & management 24

Gradient matching generative networks for zero-shot learning

A generalized representer theorem

Generalized zero-and few-shot learning via aligned variational autoencoders

Lung infection quantification of covid-19 in ct images with deep learning

Augmented attribute representations

Matching local self-similarities across images and videos

Scaling human-object interaction recognition through zero-shot learning

Zero-shot sketch-image hashing

Avatar-net: Multi-scale zero-shot style transfer by feature decoration

Radiological findings from 81 patients with covid-19 pneumonia in wuhan, china: a descriptive study. The Lancet Infectious Diseases

Bi-semantic reconstructing generative network for zero-shot learning

zero-shot" super-resolution using deep internal learning

Very deep convolutional networks for large-scale image recognition

Prototypical networks for few-shot learning

Zero-shot learning through cross-modal transfer

Learning structured output representation using deep conditional generative models, in: Advances in neural information processing systems

Transductive unbiased embedding for zero-shot learning

Going deeper with convolutions

A real-time ball detection approach using convolutional neural networks

Adversarial zero-shot learning with semantic augmentation

Shared features for multiclass object detection

Learning robust visual-semantic embeddings

Improving one-shot learning through fusing side information

Large margin methods for structured and interdependent output variables

Ranking with ordered weighted pairwise classification

Building a bird recognition app and large scale dataset with citizen scientists: The fine print in finegrained dataset collection

Pneumonia detection using cnn based feature extraction

Attention is all you need

A simple exponential family framework for zero-shot learning

Matching networks for one shot learning

The caltech-ucsd birds-200-2011 dataset

Transductive zero-shot learning with visual structure constraint

Relational knowledge transfer for zero-shot learning

Zero-shot visual recognition via bidirectional latent embedding

Zeroshot video object segmentation via attentive graph neural networks

A survey of zero-shot learning: Settings, methods, and applications

A unified probabilistic approach modeling relationships between attributes and objects

Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases

Zero-shot recognition via semantic embeddings and knowledge graphs

A discriminative feature learning approach for deep face recognition

Large scale image annotation: learning to rank with joint word-image embeddings

Clinical and ct features in pediatric patients with covid-19 infection: Different points from adults

Latent embeddings for zero-shot classification

Zero-shot learning-a comprehensive evaluation of the good, the bad and the ugly

Feature generating networks for zero-shot learning

Zero-shot learning-the good, the bad and the ugly

f-vaegan-d2: A feature generating framework for any-shot learning

Attentive region embedding network for zero-shot learning

Deep zero-shot learning for scene sketch

Transductive visual-semantic embedding for zero-shot learning

Matrix tri-factorization with manifold regularizations for zero-shot learning

Attribute hashing for zero-shot image retrieval

Cross-domain matching with squared-loss mutual information

Xlnet: Generalized autoregressive pretraining for language understanding

Zero-shot classification with discriminative semantic representation learning

Chest ct manifestations of new coronavirus disease 2019 (covid-19): a pictorial review

Attribute-based transfer learning for object categorization with zero/one training example

Zero-shot emotion recognition via affective structural embedding

Covid-19 screening on chest x-ray images using deep learning based anomaly detection

Learning a deep embedding model for zero-shot learning

Unsupervised x-ray image segmentation with task driven generative adversarial networks

Zero-shot learning via semantic similarity embedding

Zero-shot learning via joint latent similarity embedding

Zero-shot recognition via structured prediction

Zero-shot learning posed as a missing data problem

Time course of lung changes at chest ct during recovery from coronavirus disease 2019 (covid-19)

Dont even look once: Synthesizing features for zero-shot detection

Generalized zero-shot recognition based on visually semantic embedding

A generative adversarial approach for zero-shot learning from noisy texts

Aligning books and movies: Towards storylike visual explanations by watching movies and reading books

In this appendix, we provide an concise overview of the main specifications, mathematical formulas, and notations of the 26 state-of-the-art methods that we discussed and compared during this research, in a top-down matter.DAP [80] acronym of Direct Attribute Prediction, first learns the posteriors of the attributes, then estimateswhere and are the number of the classes of and the attributes of , respectively. is the ℎ attribute of the class , ( | ) is the estimated attribute via attribute classifier for image , and ( ) is the prior attributes computed for training classes with the MAP.IAP [80] is an indirect approach as it first learns the posteriors for seen classes and then uses them to compute the posteriors for the attributes:where is the number of training classes, ( | ) is the pre-trained attribute of the classes and ( | ) is probabilistic multi-class classifier to be learned.CONSE [113] takes a probabilistic approach and predicts an unseen class by the convex combination of the class label embedding vectors. It first learns the probability of the training samples:in which is the most probable label for the training sample. It then computes a weighted combination of the semantic embedding to its probability to find a label for a given unseen image.In this function, is the normalisation factor and combines semantic vectors to infer unseen labels. Linear corresponding functions are the simplest mapping functions that are typically used to map visual features to semantic spaces in vector form. If the mapping parameters are in the form of matrix, then it's called bi-linear corresponding (compatibility) function. These approaches often include other losses alongside the main mapping function.ESZSL [137] , introduces a better regulariser and optimises a close form solution objective function in a bi-linear manner., , and are the hyper-parameters. The first two terms are the Frobenius norm of the attribute features and visual features respectively, and the third term is the weight decay penalty of the matrix.

proposes a 1,2 -norm based loss function and an optimiser based on [137] to help suppress the noise in textual data.

optimises the ranking loss in [167] with a bi-linear mapping compatibility function. The objective function used in ALE is similar to unregularised structured SVM (SSVM) [166] .were (.) is the compatibility function, is the matrix with dimensions of image and label embeddings, and Δ is the loss of the mapping function. In spite of having different losses, the inspiration comes from WSABIE algorithm [183] . In ALE, rank 1 loss with a multi-class objective is used instead of all of the weighted ranks.SJE [7] similar to ALE, it learns a bi-linear compatibility function using the unregularised structural SVM objective function [166] and train their model on different supervised and unsupervised class embeddings.DeViSE [43] uses the combination of dot-product similarity and hinge rank loss used in [183] as the objective function.Here, is a hyperparmeter and ( ) are randomly selected word embeddings.

also uses bi-linear corresponding function for a part-based cross-modal framework. The visual part detectors detects bird parts from the images, while the zeroshot classifier detects performs prediction on the previously detected visual bird parts based on textual side information.DSRL [197] uses a non-negative sparse matrix factorisation for better feature alignment while learning the compatibility function. And uses label propagation to predict unseen classes.