Combining Visual Recognition and Computational Linguistics Learning from Limited Labeled Data Zero-Shot and Few-Shot Learning A dissertation submitted towards the degree Doctor of Engineering of the Faculty of Mathematics and Computer Science of Saarland University by Yongqin Xian Saarbrücken 2020 ii Day of Colloquium 7th of July, 2020 Dean of the Faculty Prof. Dr. Thomas Schuster Saarland University, Germany Examination Committee Chair Prof. Dr. Antonio Krüger Reviewer, Advisor Prof. Dr. Zeynep Akata Reviewer, Advisor Prof. Dr. Bernt Schiele Reviewer Prof. Trevor Darrell, Ph.D. Reviewer Prof. Barbara Caputo, Ph.D. Academic Assistant Dr. Paul Swoboda ABSTRACT Human beings have the remarkable ability to recognize novel visual concepts after observing only few or zero examples of them. Deep learning, however, often requires a large amount of labeled data to achieve a good performance. Labeled instances are expensive, difficult and even infeasible to obtain because the distribution of training instances among labels naturally exhibits a long tail. Therefore, it is of great interest to investigate how to learn efficiently from limited labeled data. This thesis concerns an important subfield of learning from limited labeled data, namely, low-shot learning. The setting assumes the availability of many labeled examples from known classes and the goal is to learn novel classes from only a few (few-shot learning) or zero (zero-shot learning) training examples of them. To this end, we have developed a series of multi-modal learning approaches to facilitate the knowledge transfer from known classes to novel classes for a wide range of visual recognition tasks including image classification, semantic image segmentation and video action recognition.More specifically, this thesis mainly makes the following contributions. First, as there is no agreed upon zero-shot image classification benchmark, we define a new benchmark by unifying both the evaluation protocols and data splits of publicly available datasets. Second, in order to tackle the labeled data scarcity, we propose feature generation frameworks that synthesize data in the visual feature space for novel classes. Third, we extend zero-shot learning and few-shot learning to the semantic segmentation task and propose a challenging benchmark for it. We show that incorporating semantic information into a semantic segmentation network is effective in segmenting novel classes. Finally, we develop better video representation for the few-shot video classification task and leverage weakly-labeled videos by an efficient retrieval method. iii ZUSAMMENFASSUNG Menschen haben die bemerkenswerte Fähigkeit, neuartige visuelle Konzepte zu erkennen, nachdem sie nur wenige oder gar keine Beispiele davon beobachtet haben. Tiefes Lernen erfordert jedoch oft eine große Menge an beschrifteten Daten, um eine gute Leistung zu erzielen. Etikettierte Instanzen sind teuer, schwierig und sogar undurchführbar, weil die Verteilung der Trainingsinstanzen auf die Etiketten naturgemäß einen langen Schwanz aufweist. Daher ist es von großem Interesse zu untersuchen, wie man effizient aus begrenzten gelabelten Daten lernen kann. Diese These betrifft einen wichtigen Teilbereich des Lernens aus begrenzt gela- belten Daten, nämlich das Low-Shot-Lernen. Das Setting setzt die Verfügbarkeit vieler gelabelter Beispiele aus bekannten Klassen voraus, und das Ziel ist es, neuar- tige Klassen aus nur wenigen (few-shot learning) oder null (zero-shot learning) Trainingsbeispielen davon zu lernen. Zu diesem Zweck haben wir eine Reihe von multimodalen Lernansätzen entwickelt, um den Wissenstransfer von bekannten Klassen zu neuartigen Klassen für ein breites Spektrum von visuellen Erkennungsauf- gaben zu erleichtern, darunter Bildklassifizierung, semantische Bildsegmentierung und Videoaktionserkennung. Genauer gesagt, leistet diese Arbeit hauptsächlich die folgenden Beiträge. Da es keinen vereinbarten Benchmark für die Zero-Shot- Bildklassifikation gibt, definieren wir zunächst einen neuen Benchmark, indem wir sowohl die Evaluierungsprotokolle als auch die Datensplits öffentlich zugänglicher Datensätze vereinheitlichen. Zweitens schlagen wir zur Bewältigung der etiket- tierten Datenknappheit einen Rahmen für die Generierung von Merkmalen vor, der Daten im visuellen Merkmalsraum für neuartige Klassen synthetisiert. Drittens dehnen wir das Zero-Shot-Lernen und das few-Shot-Lernen auf die semantische Segmentierungsaufgabe aus und schlagen dafür einen anspruchsvollen Benchmark vor. Wir zeigen, dass die Einbindung semantischer Informationen in ein seman- tisches Segmentierungsnetz bei der Segmentierung neuartiger Klassen effektiv ist. Schließlich entwickeln wir eine bessere Videodarstellung für die Klassifizierungsauf- gabe ”few-shot video” und nutzen schwach markierte Videos durch eine effiziente Abrufmethode. v ACKNOWLEDGEMENTS First and foremost, I would like to express my sincere gratitude to Prof. Bernt Schiele and Prof. Zeynep Akata for supervising my PhD thesis. Both of them have been great advisors. I am grateful to Bernt for his constant supports and inspiration throughout the time. He has not only provided me invaluable advices in computer vision research, but also taught me how to be a good scientist as well as a good father by setting a role model himself. Likewise, I would like to thank Zeynep for guiding me to the wonderful journey of computer vision research. She has been extremely helpful because she gave me a lot of critical hands-on supervision and encouragement. None of my research presented in the thesis would be possible without her. I am fortunate and thankful for having both of them as my advisors. I am also truly thankful to the other members in my dissertation committee. Thanks Prof. Trevor Darrell and Prof. Barbara Caputo for serving as external reviewers and attending my defense at those difficult times caused by the COVID-19 virus. Thanks Prof. Antonio Krüger for his quick responses and agreeing to chair the defense. Thanks Dr. Paul Swoboda for being the academic assistant. Their invaluable feedback and discussion on my thesis have helped and inspired me a lot. I also would like to thank my lovely colleagues at MPII, not only for the inspiring discussion and collaboration concerning research but also for sharing a lot of happy moments outside of the work: Connie Balzert, Apratim Bhattacharyya, Rakshith Shetty, Philipp Müller, Eldar Insafutdinov, Anna Kukleva, Dr. Mykhaylo Andriluka, Dr. Gerard Pons-Moll, Prof. Wei-Chen Chiu, Dr. Anna Khoreva, Dr. Jan-Hendrik Lange, Dr. Wenbin Li, Prof. Siyu Tang, Prof. Shanshan Zhang, and Dr. Xucong Zhang. I owe particular thanks to Connie for helping me to handle a lot of difficult matters regarding my life in Germany. Thanks Jan-Hendrik and Philipp for helping me to translate many German letters into English. I also shared an office with Rakshith and we had many fruitful discussions about research. Thank you, Rakshith! It was a great pleasure to work with these talented people. Furthermore, I would like to thank my collaborators, without whom I would have no chance to complete the thesis: Saurabh Sharma, Dr. Gaurav Sharma, Dr. Yang He, Prof. Matthias Hein, Dr. Quynh Nguyen Ngoc, Prof. Christoph H. Lampert, Prof. Lorenzo Torresani, Bruno Korbar and Dr. Matthijs Douze. I am particularly grateful to Lorenzo for supervising my internship at Facebook AI in Boston. Similarly, my thanks go to the students that I had the chance to supervise or work with: Yue Fan, Subhabrata Choudhury, Tobias Lorenz, Wenjia Xu and Miaoran Zhang. Last but not the least, I am deeply thankful to my family and friends who have been constantly loving and supporting me. I would like to especially thank my wife Dr. Yijuan Qiao for her encouragement and sacrifice. My deepest gratitude also goes to my parents and brother who always love me without any condition. This thesis is dedicated to my beloved daughter Odelia who was born in the end of my PhD. vii C O N T E N T S 1 Introduction 1 1.1 Challenges of learning from limited labeled data . . . . . . . . . . . . . 3 1.1.1 Zero-shot image classification . . . . . . . . . . . . . . . . . . . . 4 1.1.2 Few-shot image classification. . . . . . . . . . . . . . . . . . . . . 5 1.1.3 Zero-shot and few-shot learning tasks beyond image classification 6 1.2 Contributions of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2.1 Contributions to zero-shot image classification . . . . . . . . . . 6 1.2.2 Contributions to few-shot image classification . . . . . . . . . . 8 1.2.3 Contributions to zero-shot and few-shot tasks beyond image classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2 Related work 13 2.1 Zero-shot image classification . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1.1 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1.2 Evaluation protocol . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.1.3 A literature review of zero-shot approaches . . . . . . . . . . . 18 2.1.4 Relations to our work . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2 Few-shot image classification . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2.1 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.2 Evaluation protocols . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.3 A literature review of few-shot approaches . . . . . . . . . . . 22 2.2.4 Relations to our work . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3 Zero-shot and few-shot tasks beyond image classification . . . . . . . . 24 2.3.1 Semantic image segmentation . . . . . . . . . . . . . . . . . . . . 24 2.3.2 Video action recognition . . . . . . . . . . . . . . . . . . . . . . . 25 2.3.3 Relations to our work . . . . . . . . . . . . . . . . . . . . . . . . 25 3 Latent Embedding for Zero-Shot Image Classification 27 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2 Background: Bilinear Joint Embeddings . . . . . . . . . . . . . . . . . . 30 3.3 Latent Embeddings Model (LatEm) . . . . . . . . . . . . . . . . . . . . . 31 3.3.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3.3 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.4.1 Zero-shot Learning Experiments . . . . . . . . . . . . . . . . . . 37 3.4.2 Generalized Zero-shot Learning Setting . . . . . . . . . . . . . . 44 3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 ix x contents 4 Zero-Shot Learning: the Good, the Bad and the Ugly 51 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.3 Evaluated Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.3.1 Learning Linear Compatibility . . . . . . . . . . . . . . . . . . . 55 4.3.2 Learning Nonlinear Compatibility . . . . . . . . . . . . . . . . . 57 4.3.3 Learning Intermediate Attribute Classifiers . . . . . . . . . . . . 57 4.3.4 Hybrid Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.3.5 Transductive Zero-Shot Learning Setting . . . . . . . . . . . . . 59 4.4 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.4.1 Attribute Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.4.2 Large-Scale ImageNet . . . . . . . . . . . . . . . . . . . . . . . . 62 4.5 Evaluation Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.5.1 Image and Class Embedding . . . . . . . . . . . . . . . . . . . . 63 4.5.2 Dataset Splits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.5.3 Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.6.1 Zero-Shot Learning Experiments . . . . . . . . . . . . . . . . . . 66 4.6.2 Generalized Zero-Shot Learning Results . . . . . . . . . . . . . . 74 4.6.3 Transductive (Generalized) Zero-Shot Learning . . . . . . . . . 76 4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5 Feature Generating Networks for Zero-Shot Image Classification 79 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.3 Feature Generation & Classification in ZSL . . . . . . . . . . . . . . . . 82 5.3.1 Feature Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.3.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.4.1 Comparing with State-of-the-Art . . . . . . . . . . . . . . . . . . 87 5.4.2 Analyzing f-xGAN Under Different Conditions . . . . . . . . . 89 5.4.3 Large-Scale Experiments . . . . . . . . . . . . . . . . . . . . . . . 92 5.4.4 Feature vs Image Generation . . . . . . . . . . . . . . . . . . . . 93 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6 Enhanced Feature Generation Frameworks for Low-Shot Learning 95 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.3 f-VAEGAN-D2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.3.1 Baseline Feature Generating Models . . . . . . . . . . . . . . . . 99 6.3.2 Our f-VAEGAN-D2 Model . . . . . . . . . . . . . . . . . . . . . . 99 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.4.1 (Generalized) Zero-shot Learning . . . . . . . . . . . . . . . . . 101 6.4.2 (Generalized) Few-shot Learning . . . . . . . . . . . . . . . . . . 103 6.4.3 Interpreting Synthesized Features . . . . . . . . . . . . . . . . . 106 contents xi 6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 7 Zero-Label and Few-Label Semantic Segmentation 109 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 7.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 7.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 7.3.1 Semantic Projection Network (SPNet) . . . . . . . . . . . . . . . 113 7.3.2 Baseline: Hinge Visual-Semantic Loss (HVSL) . . . . . . . . . . 115 7.4 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 7.4.1 Zero-Label Semantic Segmentation Task . . . . . . . . . . . . . . 116 7.4.2 Few-Label Semantic Segmentation Task . . . . . . . . . . . . . . 121 7.4.3 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . 123 7.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 8 Generalized Many-Way Few-Shot Video Classification 125 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 8.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 8.3 R-3DFSV Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 8.3.1 3D CNN for FSV (3DFSV) . . . . . . . . . . . . . . . . . . . . . . 129 8.3.2 Retrieval-enhanced 3DFSV (R-3DFSV) . . . . . . . . . . . . . . . 131 8.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 8.4.1 Experimental settings . . . . . . . . . . . . . . . . . . . . . . . . 132 8.4.2 Comparing with the state-of-the-art . . . . . . . . . . . . . . . . 134 8.4.3 Increasing the number of classes in FSV . . . . . . . . . . . . . . 136 8.4.4 Evaluating base and novel classes in GFSV . . . . . . . . . . . . 137 8.4.5 Ablation study and retrieved clips . . . . . . . . . . . . . . . . . 138 8.4.6 Qualitative results . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 8.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 9 Conclusions and future perspectives 143 9.1 Discussion of contributions . . . . . . . . . . . . . . . . . . . . . . . . . 145 9.2 Future Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 9.2.1 Zero-shot image classification . . . . . . . . . . . . . . . . . . . . 148 9.2.2 Few-shot image classification . . . . . . . . . . . . . . . . . . . . 150 9.2.3 Zero-shot and few-shot learning beyond image classification . 151 9.2.4 A broader view on the topic . . . . . . . . . . . . . . . . . . . . . 152 List of Figures 155 List of Tables 161 Bibliography 165 1I N T R O D U C T I O N Contents 1.1 Challenges of learning from limited labeled data . . . . . . . . . . . 3 1.1.1 Zero-shot image classification . . . . . . . . . . . . . . . . . . 4 1.1.2 Few-shot image classification. . . . . . . . . . . . . . . . . . . 5 1.1.3 Zero-shot and few-shot learning tasks beyond image classi- fication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2 Contributions of the thesis . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2.1 Contributions to zero-shot image classification . . . . . . . . 6 1.2.2 Contributions to few-shot image classification . . . . . . . . 8 1.2.3 Contributions to zero-shot and few-shot tasks beyond image classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 T he demand for automated understanding of visual data (videos and images) has became more urgent than ever. Billions of images and videos uploaded on the internet demand autonomous analysis and understanding. Self-driving vehicles need a visual perception system to detect pedestrians, traffic signs and other obstacles. Hospitals need automated analysis of medical imaging data to improve the clinical efficiency. Robotics need to understand complex visual scenes for interacting with the environment. In general, solving a computer vision task consists of two necessary steps: en- coding and decoding. Given an image or video as input, the encoding step extracts features from the input and represents them as a compact vector. A lot of previous computer vision studies focus on designing hand-crafted features to encode an image or video. The decoding step extracts “patterns” from the feature vector and produces a decision depending on what the end task is. Machine learning is often applied in this step to learn the patterns in a principled way. Recent advances in computer vision are mainly due to the success of deep learning, which proposes to learn encoding and decoding simultaneously by a deep neural network optimized with task-specific losses. Despite the substantial progress, current computer vision algorithms still fail to generalize to the variety of visual environments in real-world applications. A limitation of deep learning is that it requires massive amounts of labeled data to achieve high performance. However, labeled instances are expensive, difficult and even infeasible to obtain. As shown in Figure 1.1, in almost all scenarios, there is an exponential decay in terms of number of samples per class i.e., only a few classes contain a large number of samples whereas most classes are sparsely 1 2 chapter 1. introduction N um be r of d at a Category“Car” “Goose fish”“Play guitar” Semantic Segmentation Image Classification Video Classification “Cardinal” “Busking” Figure 1.1: In almost all real-wold settings, the number of samples per category follows a skewed distribution i.e. a few categories have a large number of samples while most of categories have only a small number of samples (as shown in the left figure). The scarcity of samples results in poor generalization performance of the powerful deep learning methods which often require a huge number of labeled data to train. In this thesis, we address the challenges when learning with limited labeled data in the scenarios of image classification (e.g. He et al., 2016), semantic segmentation (e.g. Long et al., 2015) and video classification (e.g. Tran et al., 2018). populated. It becomes almost impossible to collect enough training examples for every class, leading to the inferior performance of deep neural networks. Consider a real-world example in the autonomous driving field. In order to train a reliable visual perception system for self-driving cars, current algorithms need to collect a vast amount of labeled examples that cover all the road condition, weather condition, time of driving, and obstacles. This is obviously infeasible because there are many circumstances that rarely occur e.g., big rocks on the snowy roads. As a consequence, the self-driving car is very likely to make wrong decisions when it encounters the rare circumstances. On the contrary, humans naturally possess the ability of learning novel concepts from a small number of examples. This is not only attributed to the computational power of the human brain, but also to its ability of re-using previous learned knowledge. Attaining such ability of rapid learning is particularly appealing for artificial intelligence (AI) and will push AI one step further towards human-level intelligence. The goal of this thesis is thus to address the labeled data scarcity by developing machine learning methods that can be trained with limited labeled data. Our key idea is to re-use information from related tasks, transfer knowledge across different modalities , and leverage unlabeled data to minimize the human supervision on 1.1 challenges of learning from limited labeled data 3 novel tasks. More specifically, we aim to enable deep neural networks to generalize to novel concepts with as few labeled examples as possible. In order to mimic the way that human learns new concepts i.e., by re-using previous gained knowledge, we divide classes of interests into disjoint base and novel classes. Each of the base classes has enough training examples and plays a role as previous learned knowledge. On the contrary, the novel classes have only limited training examples and the task is to develop methods that generalize well to unseen examples from those novel classes. This thesis concerns both few-shot learning where each novel class possesses a few examples (up to 10 examples per class), and zero-shot learning where novel class has no labeled example at all. In this section, we will discuss the challenges in zero-shot image classification, few-shot image classification, and their applications in other computer vision tasks e.g., semantic segmentation and video action recognition. Finally, we summarize how this thesis contributes to the fields of zero-shot and few-shot learning. 1.1 challenges of learning from limited labeled data Machine learning methods, typically deep neural networks, rely on a large labeled dataset for achieving a good performance, which makes it difficult to apply AI into real-world settings because collecting labeled data is not always possible (e.g., the skewed distribution for number of available samples in Figure 1.1). It is thus of great importance to develop machine learning methods that can learn from limited labeled data. A fundamental problem of learning from a small dataset is the risk of overfitting i.e., a model fits too closely to the limited training examples and fails to generalize to unseen test samples. When the training data is limited, smart sampling of training data, regularization and data augmentation are three classical ways to improve the generalization performance according to the statistical learning theory (Bishop, 2006). While conventional machine learning methods draw training examples uniformly, smart sampling aims to select the “best” instances to reduce the amount of required training data. An example of smart sampling is active learning where the learning algorithms select the most uncertain samples to annotate given a fixed budget of labeling cost. Recent advancements in active learning show that deep learning models can be built with limited labeled data if training examples are smartly selected. However, active learning still requires a huge pool of data to select training examples. Another principled way to reduce overfitting is regularization, which refers to technics that prevent learning algorithms from fitting too closely to the training examples. Typical regularization techniques achieve this by reducing the model complexity e.g., L2 regularizer. For deep neural networks, popular regularizers include dropout that averages multiple models, pretraining on ImageNet that provides good initialization and early stopping of optimization that avoids fitting the noise in the dataset. In addition, data augmentation addresses the labeled data scarcity by automatically generating more training data without manually collecting 4 chapter 1. introduction them. For visual understanding tasks of images, it has been shown that simple horizontal flipping and cropping of images can successfully increase the diversity of the training data and significantly improve the performance. Unfortunately, those simple techniques are still insufficient to obtain a good performance in the extreme case of lacking labeled data e.g., there is only 1 example per class. In addition to those classical approaches, emerging directions for learning with limited labeled data include weakly supervised learning and self-supervised learning. Those directions do not directly tackle the overfitting issue on the small training set like the classical approaches. Instead, they aim to learn from a big dataset that is weakly annotated or not annotated such that human supervision is reduced. For example, (Oquab et al., 2015) proposes an object detection approach with image- level labels, avoiding the expensive bounding box annotation. Self-supervised learning (Chen et al., 2020) completely eliminate human supervision by learning from an unlabeled dataset. In this thesis, we are mainly focusing on data augmentation and regulariza- tion approaches. Weakly supervised and self-supervised learning are promising directions to explore in the future but not the scope of this thesis. In the following subsections we identify the specific challenges of the tasks we want to solve and also discuss how we tackle those challenges in this thesis. 1.1.1 Zero-shot image classification Zero-shot learning refers to the ability to predict novel classes without accessing any of their training examples. In the context of image classification, the task is to predict the class label of a given image from one of the novel classes. For simplicity, this thesis will only study the case where each image consists of only one object class. This problem can be highly valuable in the fine-grained classification where annotating labeled data requires expert knowledge. Here are a few challenges we aim to address in the thesis. Multi-modal learning. In order to associate novel classes with base classes, we assume every class has some semantic information available e.g., attributes and textual description. Therefore, zero-shot learning is naturally a multi-modal learning problem. How to learn the correlation between two or even more modalities becomes a challenging research topic. Previous works (Akata et al., 2015b, 2013) often learn a bilinear compatibility function which is limited to capture the complex correlation between vision and language modalities. The zero-shot learning performance will rely on the efficiency of knowledge transfer via multi-modal learning methods. Limitation of current zero-shot benchmarks. Although the number of publi- cation in zero-shot learning is steadily increasing, there is no agreed evaluation protocol, leading to incomparable results. In addition, novel classes in existing benchmarks are present in ImageNet which is used for feature pretraining, violat- ing the principle of zero-shot learning. Finally, current benchmark only evaluates on novel classes and ignores base classes at the testing time, which is unrealistic. Real-world applications require the models to perform well on both base and novel 1.1 challenges of learning from limited labeled data 5 classes. There is an urgent demand for a better zero-shot learning benchmark. Domain shift. Zero-shot learning models are trained on the examples of base classes and evaluated on novel classes without any training examples. Therefore, there is no generalization guarantee on novel classes because their distribution is totally unknown. Zero-shot learning can be particularly challenging if there is a domain gap between distributions of novel and base classes. How to solve the domain shift issue becomes an important challenge in zero-shot learning. Extreme class imbalance Zero-shot learning suffers from the extreme case of data imbalance i.e., base classes have a lot of training examples and novel classes have no training data at all. Existing zero-shot methods essentially fail when evaluated on both base and novel classes because classifiers have a strong tendency to predict seen classes. One way to address this class imbalance problem is to employ a cost-sensitive loss (Chawla et al., 2004) or over-sampling (Chawla et al., 2002) the minority classes. However, these prior solutions are fundamentally not in line with deep learning and zero-shot learning methods. 1.1.2 Few-shot image classification. In zero-shot learning, there is no training example for novel classes, which might be too extreme. In real-world scenarios, it is often more realistic to consider few-shot learning where a few labeled examples are available for novel classes. Despite those additional training data, few-shot learning remains to be a difficult task because the number of training examples is still far from enough to learn a deep neural network. In addition to the classical regularization techniques, how can we encourage the models to share knowledge across related tasks? Risk of overfitting. Due to the small number of training examples from novel classes, directly fine-tuning a deep neural network will result in overfitting i.e., the model fits exactly to the small training set of novel classes and fails to generalize to unseen examples of novel classes. Techniques that work well in supervised learning will probably fail in the few-shot learning setting because of the overfitting. How to regularize the networks to avoid overfitting when fine-tuning the deep neural networks remains an open problem. Imbalanced classes. In few-shot learning, the number of training examples from base classes is much larger than that of novel classes, resulting in an imbalanced learning problem. Many few-shot learning papers avoid this issue by ignoring the base classes at the evaluation time. However, we argue that such evaluation setting is unrealistic and consider the imbalanced issue as one of the challenges we would like to tackle. Representation learning for few-shot learning. In the supervised learning set- ting, the goal is to learn a model that generalizes well to unseen examples from the same training task. The underlying assumption is that the distribution of test data follows that of training data. Its generalization error is guaranteed theoretically. However, few-shot learning aims for a model that generalizes well to novel tasks with a few training examples. Although conventional representation learning framework 6 chapter 1. introduction works well for the known tasks, it might not generalize well to novel tasks. How to develop efficient representation for few-shot learning remains unknown. What principles make the representation generalize better to novel tasks? 1.1.3 Zero-shot and few-shot learning tasks beyond image classification The long-tail issue does not only occur in the image classification tasks but also in other computer vision tasks. In this thesis, we additionally study the semantic segmentation and video classification tasks in the context of zero-shot and few-shot learning. Semantic segmentation. The image semantic segmentation task aims to predict a class label for every pixel in the image. This is a challenging structural output learning problem and requires expensive pixel-level labeling. Ordinary semantic segmentation methods fail to handle the images which contain novel classes. In order to tackle the long-tail issue, this thesis is interested in a semantic segmentation frame that can make zero-shot prediction on novel classes and few-shot learning on novel classes with limited labeled data. Since this is a new task, we face the challenge of how to formally define the problem. In addition, how to transfer knowledge from known classes to novel classes is another challenge as well. Video classification. The task of the video classification is to assign an action class label to a trimmed video. The few-shot learning setting becomes practical in the video domain because annotating videos is more time-consuming and the class distribution is also skewed. In addition to learn the spatial information, we have to model temporal information which is particularly critical for some video applications. A common challenge in few-shot video learning as well as in ordinary video learning is how to learn representation that encodes both temporal and spatial information. In addition, the overfitting risk becomes higher comparing to the few- shot image classification task because the video models often have larger capacity than the image models. 1.2 contributions of the thesis In this section, we summarize the contributions of this thesis in three different fields. 1.2.1 Contributions to zero-shot image classification To tackle the multi-modal learning challenges of zero-shot learning, we propose a novel compatibility learning framework by incorporating latent variables in the compatibility function. Instead of learning a single bilinear function like previous works, we propose to learn a collection of bilinear models while allowing each image-class pair to choose from them. This effectively makes our model non-linear, as in different local regions of the space the decision boundary, while being linear, is different. In addition, we propose a fast and effective method for model selection by 1.2 contributions of the thesis 7 successive pruning of an over-complete initialization. We show that such a strategy is competitive compared to standard cross-validation based model selection, while being much faster to train. We extensively evaluate our novel piece-wise linear model for zero-shot and generalized zero-shot learning settings on various aspects such as stability, interpretability, generalizability to seen and unseen classes. We define a new benchmark by unifying both the evaluation protocols and data splits of publicly available datasets used for this task. This is an important contribution as published results are often not comparable and sometimes even flawed due to, e.g. pre-training on zero-shot test classes. Our evaluation protocol emphasizes the necessity of tuning hyperparameters of the methods on a validation class split that is disjoint from training classes as improving zero-shot learning performance via tuning parameters on test classes violates the zero-shot assumption. We point out that extracting image features via a pre-trained deep neural network (DNN) on a large dataset that contains zero-shot test classes also violates the zero- shot learning idea as image feature extraction is a part of the training procedure. We recommend to abstract away from the restricted nature of zero-shot evaluation and make the task more practical by including training classes in the search space, i.e. generalized zero-shot learning setting. Moreover, we propose a new zero-shot learning dataset, the Animals with Attributes 2 (AWA2) dataset which we make publicly available both in terms of image features and the images themselves. We systematically evaluate zero-shot learning across a significant number of datasets and methods. The crux of the matter for all zero-shot learning methods is to associate observed and non observed classes through some form of auxiliary information which encodes visually distinguishing properties of objects. We thoroughly evaluate zero-shot learning approaches, by using multiple splits of several small, medium and large-scale datasets (Patterson and Hays, 2012; Welinder et al., 2010; Lampert et al., 2013; Farhadi et al., 2009; Deng et al., 2009). Therefore, we argue that our work plays an important role in advancing the zero-shot learning field by analyzing the good and bad aspects of the zero-shot learning task as well as proposing ways to eliminate the ugly ones. Our benchmark paper demonstrates that almost all the zero-shot methods fail in the generalized zero-shot learning setting where the model has to predict both base and novel classes. In order to tackle the imbalance challenge in this setting, we propose a novel conditional generative model f-CLSWGAN that synthesizes CNN features of novel classes from their semantic embeddings. Once trained, the feature generator will be able to synthesize arbitrarily many features for any class which lacks training examples. We show that data generation in the feature space works much better than in the image space because generating realistic images from semantic embeddings is a much harder task. Across five datasets with varying granularity and sizes, we consistently improve upon the state of the art in both the ZSL and GZSL settings. We demonstrate a practical application for adversarial training and propose GZSL as a proxy task to evaluate the performance of generative models. Our model is generalizable to different deep CNN features, e.g., extracted from GoogleNet or ResNet, and may use different class-level auxiliary information, 8 chapter 1. introduction e.g., sentence, attribute, and word2vec embeddings. 1.2.2 Contributions to few-shot image classification The success of our feature generation approach encourages us to extend it to the few-shot learning setting, which also suffers from the imbalance issue. To this end, we propose the f-VAEGAN-D2 model that consists of a conditional encoder, a shared conditional decoder/generator, a conditional discriminator and a non-conditional discriminator. The first three networks aim to learn the conditional distribution of CNN image features given class embeddings optimizing VAE and WGAN losses on labeled data of seen classes. The last network learns the marginal distribution of CNN image features on the unlabeled features of novel classes. Once trained, our model synthesizes discriminative image features that can be used to augment softmax classifier training. Our empirical analysis on CUB, AWA2, SUN, FLO, and large-scale ImageNet shows that our generated features improve the state-of-the-art in low-shot regimes, i.e., (generalized) zero- and few shot learning in both the inductive and transductive settings. We demonstrate that our generated features are interpretable by inverting them back to the raw pixel space and by generating visual explanations. 1.2.3 Contributions to zero-shot and few-shot tasks beyond image classification We introduce novel (generalized) zero-label and few-label semantic image segmenta- tion tasks in a realistic settings inspired by zero-shot learning for image classification. In zero-label semantic segmentation (ZLSS), our aim is to segment previously unseen, i.e. novel, classes, in few-label semantic segmentation (FLSS) these novel classes have a small number of labeled training examples. In this work, we also aim for learning without forgetting the previously seen classes, i.e. generalized ZLSS and FLSS. To this end, we propose semantic projection network (SPNet), an end-to-end semantic segmentation model which maps each image pixel to a semantic word embedding space where it is projected with a fixed word embedding to class proba- bilities optimizing the cross-entropy loss. We create a benchmark for (generalized) zero- and few-label semantic image segmentation with two challenging datasets, i.e. COCO-Stuff and PASCAL-VOC. Our analysis shows that the SPNet model achieves impressive results both quantitatively and qualitatively in (generalized) zero-label and few-label tasks. Furthermore, as a side-product, our model improves the state of the art in zero-shot image classification demonstrating that it successfully generalizes to other tasks. We push the progress of few-shot video classification in three aspects: 1) To learn the temporal information, we revisit spatiotemporal CNNs in the few-shot video clas- sification regime. We develop a 3D CNN baseline that maintains significant temporal information within short clips; 2) We propose to retrieve relevant tag-labeled videos from a large video dataset, i.e. YFCC100M, to circumvent the need for class-labeled 1.3 outline of the thesis 9 videos of novel classes; 3) We extend current few-shot video classification evaluation by introducing two challenging experimental settings. In generalized few-shot video classification task, the search space has no restriction in terms of classes. In few-shot video classification with more ways, the search space goes beyond five towards all classes. Our extensive experimental results demonstrate that on existing settings spatiotemporal CNNs outperform the state-of-the-art by a large margin, and on our proposed settings weakly-labeled videos retrieved using tags successfully tackles both of our new few-shot video classification tasks. 1.3 outline of the thesis In this section, we provide an overview of the thesis by briefly summarizing each chapter and draw a connection between them. We also note the respective publica- tions and collaborations with other researchers. Chapter 2: Related work. This chapter surveys related work which tackles chal- lenges of learning with limited labeled data with a focus on the three directions of the thesis i.e., zero-shot image classification, few-shot image classification and zero- and few-shot tasks beyond image classification. We discuss how these works relate to the approaches and contributions presented in this thesis. A discussion of related work specific to the following chapters is provided within each chapter. Chapter 3: Latent Embedding for Zero-Shot Image Classification. In this chapter, we tackle the zero-shot image classification problem by developing a novel compatibility function that learns non-linear relationship between the image and semantic class embedding spaces. The content of this chapter is an extension of Yongqin Xian’s Master Thesis, which was published in CVPR 2016 with the title Latent Embedding for Zero- Shot Image Classification (Xian et al., 2016). The following significant changes have been made in our extension: comparing with four other SOTA methods, evaluating in generalized zero-shot and few-shot settings, and combining multiple class embeddings for better performance. Yongqin Xian was the lead author of this paper. It is a collaboration with Gaurav Sharma, and the Machine Learning Group of Saarland University. Chapter 4: Zero-Shot Learning: the Good, the Bad and the Ugly. In this chapter, we show that existing zero-shot learning evaluation protocols adopted by Chapter 3 and other works are limited. Therefore, we introduce a new zero- shot learning benchmark which resolves the issues of previous protocols. Our new benchmark involves 5 datasets and includes both zero-shot learning set- ting that only predicts novel classes and generalized zero-shot learning which predicts both base and novel classes. We provide a better summarization of existing approaches by classifying them into groups and evaluating them under the unified evaluation protocol. 10 chapter 1. introduction The content of this chapter was published in TPAMI 2019 with the title zero- shot learning - a comprehensive evaluation of the good the bad and the ugly (Xian et al., 2019b), which is an extension of our CVPR 2017 publication Zero-Shot Learning-the Good, the Bad and the Ugly (Xian et al., 2017). Yongqin Xian was the lead author of both papers. It is also a collaboration with Christoph Lampert from IST Austria. Chapter 5: Feature Generating Networks for Zero-Shot Image Classification. In this chapter, we tackle the issues we observe in Chapter 4. More specifically, we found that almost all the zero-shot learning approaches fail to achieve good performance on novel classes in the generalized zero-shot learning setting due to the extreme imbalanced dataset. To this end, we propose a novel generative model that synthesizes visual features for novel classes from their semantic class embeddings. The generative model is learned on base class data and can be used to synthesize arbitrarily many visual features for novel classes, alleviating the data imbalance issue. The content of this chapter corresponds to the CVPR 2018 publication Feature Generating Networks for Zero-Shot Learning (Xian et al., 2018). Yongqin Xian was the lead author of this paper, while Tobias Lorenz contributed the image generation part. Tobias Lorenz’s bachelor thesis at MPI Informatics was co- supervised by Yongqin Xian and Bernt Schiele. Chapter 6: Enhanced Feature Generation Frameworks for Low-Shot Learning. Based on the success of feature generation technique described in Chapter 5 on zero- shot learning tasks, we improve the generative model in Chapter 5 in two aspects. First, we combine GANs and VAE to obtain a stronger generative model that attains the strength of adversarial and non-adversarial learning. Second, we additionally add a discriminator that learns the marginal distribu- tion of novel classes when their unlabeled data is available. We also propose to interpret generated features by inverting them back into the image pixel space. The content of this chapter corresponds to the CVPR 2019 publication f- VAEGAN-D2: A Feature Generating Framework for Any-Shot Learning (Xian et al., 2019c). Yongqin Xian was the lead author of this paper while Saurab Sharma contributed the feature explanation part. Chapter 7: Zero-Label and Few-Label Semantic Segmentation. Previous chapters are all about image classification. In this chapter, we introduce a novel image semantic segmentation task that aims to segment novel classes that have zero or very few training examples. We propose an approach called SPNet that projects each pixel into a semantic embedding space such that knowledge can be transferred from base classes to novel classes. We show that our method can tackle both zero-label and few-label semantic segmentation tasks. The content of this chapter corresponds to the CVPR 2019 publication Sematic Projection Network for Zero-Label and Few-Label Semantic Segmentation (Xian et al., 1.3 outline of the thesis 11 2019a). Yongqin and Subhabrata Choudhury were the first co-authors of this paper. Yongqin Xian contributed to the main ideas, zero-shot image classification experiments, and writting of the paper. Subhabrata Choudhury implemented the approach and conducted most of the experiments. It is also a collaboration with Yang He. Chapter 8: Generalized Many-Way Few-Shot Video Classification. In this chapter, we shift from image classification tasks to the video classification task which predict the action label of each video in the context of few-shot learning. We show that a simple linear classifier baseline with 3D CNNs as the backbone sur- passes existing few-shot video classification benchmark. Therefore we propose a more realistic and challenging evaluation setting called generalized few-shot video classification involving more classes. We develop an efficient retrieval- based few-shot learning approach that leverages weakly-labeled videos from a large-scale video dataset. The content of this chapter is still under review for a conference by the time of submitting this thesis. The lead author of this project was Yongqin Xian. This is his internship project done at Facebook AI together with Lorenzo Torresani, Bruno Korbar and Matthijs Douze Chapter 9: Conclusions and future perspectives. This chapter concludes the thesis by summarizing the contributions and highlighting their current limitations and possible directions to overcome them. We provide an outlook on our ongoing and future work and discuss future directions for the field. 2R E L A T E D W O R K Contents 2.1 Zero-shot image classification . . . . . . . . . . . . . . . . . . . . . . 13 2.1.1 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1.2 Evaluation protocol . . . . . . . . . . . . . . . . . . . . . . . . 17 2.1.3 A literature review of zero-shot approaches . . . . . . . . . 18 2.1.4 Relations to our work . . . . . . . . . . . . . . . . . . . . . . . 19 2.2 Few-shot image classification . . . . . . . . . . . . . . . . . . . . . . . 20 2.2.1 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.2 Evaluation protocols . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.3 A literature review of few-shot approaches . . . . . . . . . . 22 2.2.4 Relations to our work . . . . . . . . . . . . . . . . . . . . . . . 23 2.3 Zero-shot and few-shot tasks beyond image classification . . . . . . 24 2.3.1 Semantic image segmentation . . . . . . . . . . . . . . . . . . 24 2.3.2 Video action recognition . . . . . . . . . . . . . . . . . . . . . 25 2.3.3 Relations to our work . . . . . . . . . . . . . . . . . . . . . . . 25 T he field of learning with limited labeled data covers a wide range of topics including semi-supervised learning, unsupervised learning, self-supervised learning, weakly-supervised learning, few-shot learning and zero-shot learn- ing. This thesis will mainly focus on few-shot and zero-shot learning tasks. In this chapter, we formally define the research problems chosen in this thesis. We present the most relevant and recent developments in the fields and relate them to the contri- butions of this thesis in the conclusion of each section. The following chapters also discuss related work, but targeted to the respective topic of the respective chapter. 2.1 zero-shot image classification The ability of predicting previously unseen classes, called zero-shot learning, is an extreme case of learning with limited labeled data. In object recognition or image classification, the task of zero-shot learning is to predict the label of an image belonging to one of novel object classes that do not appear during training time. The only available information on novel classes is the semantic information that describes those classes. Humans are able to predict unseen objects by combining their prior knowledge and textual description of novel classes. For instance, given an image of Scarlet Tanager (we probably have never seen before), we will have a high chance to make a correct prediction after reading the textual description of Scarlet Tanager. Inspired by the human brains, zero-shot object recognition can be addressed by 13 14 chapter 2. related work performing multi-modal learning from both image and semantic information. In the following, we will first formally define zero-shot learning. Different modalities of data and evaluation protocols will be discussed next. Then we will try to give an overview of existing zero-shot learning approaches by grouping them. Finally, the relationship between this thesis and existing works will be discussed. 2.1.1 Problem definition Let T = {(x, y)|x, y ∈ S} be the training set where x denotes an image instance and y is its class label belonging to one of seen classes S. We are interested in predicting a disjoint set of classes U (S∩U = ∅), called unseen classes, without any observed examples. Clearly, this task can not be solved without any information of unseen classes. So additionally, we assume some auxiliary information, e.g., textual description, about each class i.e. seen and unseen classes, is provided to allow knowledge transfer from the seen classes to unseen classes. 2.1.1.1 Image embedding For a visual recognition task, one of the most important components is to extract features from images. The image feature is in the form of a vector in some arbitrary feature space and should ideally capture discriminative characteristics of an image i.e., shape, color, texture etc. The features are then fed into machine learning algorithms to learn classifiers that distinguish between different objects. In this thesis, we call the image features as image embedding. Formally, we define the image embedding of a given image x as φ(x) where φ(•) is a function that maps an image x to a dx-dimensional feature space. Before the success of deep learning, image features are often manually designed by computer vision researchers. There have been a lot of studies on how to build robust image features or descriptors manually. Deep learning takes a brave new perspective to learn image representation together with the end task from a big amount of training data. Deep image representation quickly revolutionized the fields and become the standard way to extract image feature. Next, I will briefly review this two groups of image features. Hand-crafted image representation. Typical hand-crafted image features aggre- gate some image descriptors extracted from local image regions, which is obtained by interest region detection algorithms e.g., Harris-affine detector(Mikolajczyk and Schmid, 2004). A simple image descriptor is the histogram of pixel intensities. In or- der to achieve the illumination invariant, (Zabih and Woodfill, 1994) have proposed to use histograms of ordering and reciprocal relations between pixel intensities. A more widely used image descriptor is the scale invariant feature transform (SIFT)(Lowe, 1999), which computes a gradient histogram over local regions obtained by a scale invariant region detector. (Bay et al., 2008) further proposes the speeded up robust features (SURF) which is stronger and faster than the SIFT. A comprehensive review of image descriptors can be found in (Mikolajczyk and Schmid, 2005). A popular way 2.1 zero-shot image classification 15 to aggregate image descriptors extracted from local image regions is Bag-of-visual- words (BOV) which assigns each descriptor to the closest visual vocabulary obtained by k-means clustering (Arandjelovic and Zisserman, 2013). (Sánchez et al., 2013) proposes Fisher Vector that extends BOV to use a Gaussian mixture model. BOV ignores the spatial relationship between image patches, therefore, Spatial Pyramid Matching (Yang et al., 2009) was proposed to address this issue. Deep image representation In contrast to aforementioned hand-crafted image fea- tures that adopt a manually designed extraction pipeline, deep image representation directly learns the image embedding function φ(•) via a deep convolutional neural network (CNN or ConvNet) (LeCun et al., 2015). A simple example of neural network is the multi-layer perceptron (MLP) which stacks multiple fully connected (FC) layers with a non-linear operation e.g. ReLU, after each layer. The FC layers connect each neuron in current layer to all the neurons in the next layer with different learnable weights. This is obviously prone to overfit because of such huge number of model parameters. Therefore, ConvNet regularizes the neural network by considering only local connection of neurons and sharing weight parameters across different local neighborhood. Such regularizer can be efficiently implemented by the convolution operation. The first convolutional neural network architecture , called LeNet(LeCun et al., 1989), was introduced by Yann Lecun. A 5-Layer LeNet architecture follows CONV-POOL-CONV-POOL-FC-FC where CONV represents the convolutional layer followed by a non-linear function, POOL is the max pooling that subsamples the feature maps, and FC is the fully connected layer. AlexNet (Krizhevsky et al., 2012) improves LeNet by stacking more CONV layers without pooling and won the Ima- geNet ILSVRC challenge in 2012. GoogLeNet (Szegedy et al., 2015) introduces the inception module and replaces FC layers with the global average pooling, dramati- cally reducing the number of parameters compared to AlexNet. VGG (Simonyan and Zisserman, 2014b) shows that depth of the network plays an important role for good performance. Current popular CNN architecture is ResNet which introduces skip-connection and makes the network as deep as 152 layers. There are also a few extensions of ResNet proposed like DenseNet (Huang et al., 2019), ResNeXt, etc. Recently, Neural Architecture Search(Zoph and Le, 2016), which aims to learn the network architecture automatically, has obtained increasing attention. The CNN networks are often learned with the backpropagation algorithm with a task specific loss such as the cross-entropy loss for multi-class image classification. The object function of learning CNN is non-convex because of its highly non-linear structure. But empirically, SGD-based algorithms are sufficient for a good performance. The- oretical studies about the optimization of CNN can be found in (Nguyen et al., 2019). 2.1.1.2 Class embedding Zero-shot image classification is a multi-modal learning problem where image examples of unseen classes are not available and learning of unseen classes relies 16 chapter 2. related work on another modality of data. This modality often comes from some high-level semantic information such as human annotated attributes or text descriptions. The semantic information is usually assumed to be in the class level. Therefore, we call it class embedding. One can consider the class embedding as the prototype that represents the abstract of a class. The class embedding plays an important role in the zero-shot learning image classification. Good class embeddings should capture visual similarities between classes. One can refer to (Akata et al., 2015b) for a comprehensive evaluation of different class embeddings in zero-shot learning. In this section, we discuss four different class embeddings that are widely used in zero-shot learning. Attribute Attributes describe the visual properties of an object, such as “red” , “spotted” or “striped”. The appearance of an object class can often be represented by combinations of different colors, shapes, and patterns. Therefore, they are useful cues to recognize objects. Most importantly, attributes are shared among objects such that knowledge learned from seen classes can be transferred to unseen classes. In order to annotate attributes, we have to first define attribute vocabularies that are discriminative enough to distinguish the object classes of our interests. For instance, on the Caltech-UCSD Birds-200-2011 Dataset (CUB), a vocabulary of 312 binary attributes e.g., eye color yellow, beak shape sharp, was selected based on an online tool for bird species identification1. Then each bird image is annotated with those 312 binary attributes i.e., check if this attribute appear in the image or not, with Mechanical Turk. Such annotation provides image-level attributes, while class embedding is defined for each class. Class embeddings are often produced by averaging image-level attributes of the images belonging to each class. Word embedding Attribute provides accurate visual properties of objects, but it requires expensive manual annotation. An alternative to avoid annotation is the word embedding, which is a technique that maps each word from a vocabulary to a vector of real numbers. This mapping can be learned with a neural network in an unsupervised way on large text corpus e.g., Wikipedia. Popular word embeddings include word2vec(Mikolov et al., 2013a), glove(Pennington et al., 2014), fasttext(Joulin et al., 2016a), etc. Word2vec is a language model parameterized with a neural network. In its continuous bag-of-words architecture, the model predicts the current word from a window of surrounding context words. By learning the word co- occurrence, the resulting word embedding captures semantic similarities between different words i.e., word embeddings of semantically related words are close in the embedding space. For zero-shot learning, we employ the word embeddings of class names as their class embeddings. Such strategy is inexpensive, but word embeddings often lead to poor zero-shot learning results because they often do not reflect visual similarities between classes. Therefore, there are some works that try to inject visual information into word embeddings. Moreover, one word could represent multiple meanings such that its word embedding is ambiguous. Bert provides a solution for 1https://www.whatbird.com/ 2.1 zero-shot image classification 17 that by incorporating context to word embeddings. Class hierarchy Object categories are naturally in a hierarchical structure. For instance, “albatross” and “crow” are subordinate of “bird” which is again subordi- nate of “animal”. Such class hierarchy provides relatedness between object classes as well. WordNet is a database of English words and it defines such a hierarchy where words are linked together by their semantic relationships in a tree structure. Standard neural network cannot be directly applied to the class hierarchy as the tree structure is not Euclidean data. In order to use the class hierarchy for zero-shot learning, we could either derive word embedding for each node or directly apply graph convolution on top of the class hierarchy. Text description The word embeddings of class names are often insufficient to describe a class category because they are trained on noisy text corpus. As we discussed before, we prefer class embeddings that could capture visual similarities between classes. This motivates us to consider annotating text description for images. More specifically, for each image, we could write several sentences to describe the visual content in the image. The class embedding can then be learned via a language model i.e., LSTM. 2.1.2 Evaluation protocol In contrast to the supervised image classification where the model is trained and evaluated on the same label space, zero-shot learning methods should be trained and evaluated on different label spaces. Therefore we have to first define disjoint class sets for training and testing respectively. The data split is usually generated within one dataset i.e., classes of a dataset are divided into two disjoint sets i.e., seen classes for training and unseen classes for testing. Next we produce a training set including images of all the seen classes and a test set including hold-out images of the unseen classes. If we are interested in seen classes at the test time, the test set should also include hold-out images of the seen classes. In this section, we will only discuss several existing zero-shot learning evaluation protocol in a high-level. Details of the protocols will be introduced in Section . Lampert et al. (2013) introduce the first evaluation protocol for zero-shot image classification. The authors propose a dataset called AWA consisting of 50 classes in total. Those classes are randomly split into 40 seen and 10 unseen classes. A model is trained on the images of seen class and evaluated on unseen classes with the top-1 classification accuracy. Rohrbach et al. (2012) define another zero-shot data split on the ImageNet where they split 1000 classes into 800 seen and 200 unseen classes. Elhoseiny et al. introduce zero-shot splits on CUB (Welinder et al., 2010) and Oxford Flowers (Nilsback and Zisserman, 2008) datasets. Classes of CUB are randomly split into 160 seen and 40 unseen classes on CUB, while Oxford flowers are divided into 82 seen and 10 unseen classes. Akata et al. (2013) introduce another data split on CUB with 150 seen and 50 unseen classes. Besides, Socher et al. (2013) generate a 18 chapter 2. related work zero-shot split on CIFAR10. Finally, Lampert et al. (2013) extends their work into a journal by extending their evaluation on SUN (Xiao et al., 2010) and aPY (Farhadi et al., 2009). 2.1.3 A literature review of zero-shot approaches Zero-shot learning has attracted increasing attention since the first paper published by (Lampert et al., 2013). Given such a big number of zero-shot learning publications, it is difficult to discuss all of them. Instead, we summarize popular zero-shot learning approaches published in top conferences or journals by grouping them into five categories i.e., Attribute-based methods, compatibility learning, generative models, direct classifier prediction, transductive zero-shot learning. Chapter of this thesis describes our survey paper about zero-shot learning where we discuss many zero-shot learning works. This section is complementary to that by introducing additional reference and more recent papers. Attribute-based methods Early works tackle zero-shot learning by first solving the attribute prediction problem. Attribute predictions are then aggregated to make a prediction on unseen classes. To this end, Lampert et al. (2013) proposes direct attribute prediction and indirect attribute prediction methods. Jayaraman and Grauman (2014) argue that annotated attributes are not always and adopt a random forest to address this issue. Al-Halah et al. (2016) propose to predict the attribute class embedding of unseen classes without manual annotation. Compatibility learning Instead of learning attribute classifiers, compatibility learning frameworks directly learn a compatibility function that measures the simi- larity between two modalities i.e., image embedding and class embedding. Because of its efficiency and flexibility, many recent works follow this direction. ALE (Akata et al., 2013) and CONSE (Norouzi et al., 2014) learn linear compatibility function with the ranking loss. Similarly, SJE (Akata et al., 2015b) adopts the multi-class max-margin loss. ESZSL (Romera-Paredes et al., 2015) proposes a loss that has a closed-form solution. Semantic autoencoder (Kodirov et al., 2017) for zero-shot learning regularizes the model by auto-encoder loss. Zhang et al. (2017b) argue that semantic embedding space has hubness problem and propose to learn a non-linear embedding function that maps the semantic embedding into the image embedding space. Recently, Ji et al. (2018b) propose to learn feature representation with attention conditioned on the semantic embedding. Similarly, Xie et al. (2019) propose to learn attention on local regions for more generalized representation. Generative models The aforementioned methods are discriminative approaches where they directly model the posterior probability distribution of labels given the input i.e., p(y|x). Generative approaches instead model the joit distribution of input and output i.e., p(x, y). An advantage of generative model is that arbitrarily many samples can be synthesizing for unseen classes, addressing the issues of lacking data. 2.1 zero-shot image classification 19 Verma and Rai (2017) assume p(x|y) to be Gaussian distribution. Kumar Verma et al. (2018a) learn to synthesize features of unseen classes via a VAE. Similarly, Zhu et al. (2018a) proposes a GAN framework to generate features from noisy text descriptions. Both Schonfeld et al. (2019) and Mishra et al. (2018) learn a VAE to generate features. Felix et al. (2018b) use cycle-consistency loss to regularize the GANs. Direct classifier prediction Instead of synthesizing samples, SYNC (Changpinyo et al., 2016) proposes to directly synthesize the classifier weights of unseen classes. Elhoseiny et al. (2013) take a similar approach with textual description as the class embedding. Changpinyo et al. (2017) apply kernel methods to synthesize the visual prototype of unseen classes. Lei Ba et al. (2015) apply a neural network to predict the classifier weights of unseen classes. Wang et al. (2018a) leverage the class hierarchy and learn to regress classifier weights of unseen classes with a graph convolutional neural network. Kampffmeyer et al. (2019) extend Wang et al. (2018a) by constructing a better graph. Transductive zero-shot learning Conventional zero-shot learning setting is often inductive i.e., images of unseen classes are not available during training. In the real-world scenario, it is possible that unlabeled images from unseen classes are available and we aim to label them. This motivates us to study the transductive learning setting where labeled images from seen classes and unlabeled images from unseen classes are available. Fu et al. (2014) construct a graph with both labeled and unlabeled images and performs label propagation. Kodirov et al. (2015) leverage the unlabeled data to reduce the domain gap between seen and unseen classes. In order to address the biased prediction towards seen classes, Song et al. (2018) propose to minimize the probability of predicting unseen class images as seen classes. Liu et al. (2018) introduce a neural network that calibrates the predicted probabilities with unlabeled images from unseen classes. 2.1.4 Relations to our work In Chapter 1, we introduce a novel compatibility learning framework for zero-shot learning. In contrast to previous works that learn a linear compatibility function, we propose to learn a non-linear function by learning multiple linear transformations with the selection of which transformation to use being a latent variable. In Chapter 2, we take a step back and analyze the status quo of the area. We find that there exist inconsistent evaluation protocols for zero-shot learning and some of them are even flawed, leading to incomparable or incorrect results. Therefore, the main purpose of our work is to define an unified evaluation protocol for zero-shot learning and re-evaluate existing approaches under the same protocol to show the true progress of the field. Our benchmark is built on (Lampert et al., 2013), but we extend its evaluation protocol to cover more datasets and the more realistic generalized zero-shot learning setting where the model has to predict both seen and unseen classes. Our work is also inspired by ?. where they empirically show the 20 chapter 2. related work challenges of generalized zero-shot learning. But the main contribution of our work is not only to advocate the generalized zero-shot learning, but also to introduce a unified zero-shot learning benchmark for future research. In Chapter 3, in order to tackle generalized zero-shot learning, we propose to generate visual features of unseen classes conditioned on class embeddings. There are two concurrent works that share similar ideas with us. Bucher et al. (2017) adopt a GMMN (Li et al., 2015) to generate feature and Mishra et al. (2018) apply a VAE (Kingma and Welling, 2014). Our paper takes the powerful GANs (e.g. Goodfellow et al., 2014; Arjovsky and Bottou, 2017; Arjovsky et al., 2017) and improve it by including a classification loss that enforces generated features can be better suited for the classification task. In addition, our work shows that our generated feature can be applied to improve many popular zero-shot methods, which is more generalizable. There have been a group of papers which follows our ideas and improve the feature generation process by regularizing the generators, proposing more complicated generative networks, and using different class embeddings. In Chapter 4, we extend our feature generating networks in Chapter 3 to any-shot and transductive learning settings. We improves our f-CLSWGAN by combining VAE (e.g. Kingma and Welling, 2014) and GANs (e.g. Goodfellow et al., 2014; Ar- jovsky and Bottou, 2017; Arjovsky et al., 2017), leveraging the strength of adversarial and non-adversarial generative models. In order to learn from unlabeled data, we propose to add an additional discriminator for learning the marginal probability distribution of unseen classes. Previous transductive zero-shot learning (e.g. Fu et al., 2014; Kodirov et al., 2015) is often solved by the label propagation technique. Our approach improves the feature generator by modeling the marginal distribution of unlabeled images. Besides, comparing to other feature generating papers (e.g. Kumar Verma et al., 2018a; Zhu et al., 2018a; Schonfeld et al., 2019; Felix et al., 2018b), our proposed framework is more flexible and can be applied to solve inductive zero- shot learning where there is no image from unseen classes, transductive zero-shot learning where unlabeled images from unseen classes are available, and few-shot learning where there are a few images per unseen classes. 2.2 few-shot image classification In general, few-shot learning aims to learn a model e.g., deep neural network, with limited labeled data. Learning a deep neural network from scratch with a small amount of data is not possible because of its massive number of model parameters. Therefore, few-shot learning setting assumes the availability of some base classes which have enough labeled data. The task becomes how we learn a model from those base classes such that it generalizes well to novel classes with only few labeled data. This is an important problem to solve because the numbers of labeled data per category follow a long-tail distribution i.e., there are a small number of classes with a lot of data while most of classes have limited training data. In this section, we first formally define the few-shot image classification problem and introduce the existing 2.2 few-shot image classification 21 evaluation protocols. Then we discuss popular few-shot approaches in Section 2.2.3 and the relations between those and our proposed approaches. 2.2.1 Problem definition Let Tb = {(x, y)|x ∈ X , y ∈ Cb} be a labeled training set for base classes where x denotes an image instance in the RGB image space X and y is its class label belonging to one of base classes Cb. Each base class has enough training data (typically larger than 30 images). We are interested in a disjoint set of classes Cn (Cn ∩ Cb = ∅), called novel classes. Similarly, we define its training set as Tn = {(x, y)|x ∈X , y ∈ Cn}. In contrast to base classes, we assume each novel class consists of only few training data (usually less than 10 images). Therefore, the size of the training set of base classes is much larger than that of the novel classes i.e., |Tb|� |Tn|. Given training sets Tb and Tn, the task of few-shot learning is to learn a model that generalizes well to the hold-out test set of novel classes Cn. 2.2.2 Evaluation protocols In order to evaluate few-shot learning approaches, the first step is to produce a data split that consists of a training set Tb of base classes and a training set Tn of novel classes. However, there exist multiple different protocols that define how to evaluate few-shot learning approaches on the novel classes. Most of papers focus on the constrained meta-learning setting, while some papers also follows the low-shot setting which is relatively more realistic. Here we will mainly discuss the most popular three protocols i.e., low-shot learning setting, meta-learning setting and improved meta-learning setting. Low-shot learning setting. In this setting, all the novel classes and base classes are evaluated simultaneously. Qi et al. (2018) introduce a data split on CUB where 100 classes are base and the rest 100 classes are novel. For a k-shot learning problem, they randomly draw k samples per novel class to form the training set Tn where k ∈ {1, 2, 5, 10, 20}. The performance is then evaluated on the hold-out test set of the novel classes. To make it more realistic, they also evaluate on all classes including both base and novel classes. In this case, there will be a hold-out test set for base and novel classes respectively. The top-1 image classification accuracy will be reported. CUB is a relatively small-scale and fine-grained dataset with only 10K images. To evaluate few-shot approaches in a large-scale setting, Hariharan and Girshick (2017) propose a low-shot data split on the ImageNet. The 1000 ImgeNet classes are divided into 389 base categories and 611 novel categories. For the purpose of cross-validation, they further construct two disjoint sets of classes by dividing the base categories into two subsets C1b (193 classes) and C 2 b (196 classes) and the novel categories into C1n (300 classes) and C 2 n (311 classes). While C 1 b and C 1 n are used for tuning hyperparameters, the final results are reported on C2b and C 2 n for k-shot problems where k ∈{1, 2, 5, 10, 20}. Finally, our f-VAEGAN-D2 extends the 22 chapter 2. related work zero-shot splits into few-shot splits by randomly drawing k examples from each unseen class to form the training set Tn. Meta-learning setting. The meta-learning setting (e.g. Vinyals et al., 2016; Snell et al., 2017; Finn et al., 2017) has gained increasing attention recently. Instead of treating all the novel classes as a big task, this setting generates many small tasks by randomly sampling subsets from the novel classes. More specifically, the evaluation is conducted in the episodic manner where each episode constructs a k-shot, n-way classification task with a training set Tn and a test set. The final results are obtained by averaging the test accuracy over multiple episodes. Existing papers mainly consider the following four tasks: 1-shot 5-way, 5-shot 5-way, 1-shot 20-way, and 5-shot 20-way. Matching Networks (Vinyals et al., 2016) introduce the meta-learning setting and propose data splits on the Omniglot and the miniImageNet datasets. Improved meta-learning setting. Triantafillou et al. (2019) argues that current meta- learning benchmarks (e.g. Vinyals et al., 2016; Snell et al., 2017; Finn et al., 2017) do not have sufficient complexity to access the few-shot learning process. Therefore, they propose the meta-dataset, a new large-scale, benchmark that is more realistic. Meta-dataset improves current meta-learning setting in three aspects: 1) evaluate the cross-dataset generalization performance with 10 different datasets 2) vary the number of classes and examples per class 3) consider the relationships between classes when forming episodes. 2.2.3 A literature review of few-shot approaches Few-shot learning is challenging because novel classes have limited labeled data. Directly fine-tuning a deep CNN on the novel classes will inevitably lead to over- fitting. On the other hand, due to the domain gap between base and novel classes, directly applying the pretrained model would suffer from domain shift issues. A group of papers investigate ways that efficiently adapt a model pretrained on base classes to novel classes with only a few training examples. In this case, few-shot learning problem is treated as a transfer learning problem. This direction is usually evaluated in the low-shot learning setting. In addition, there are also a significant number of papers that propose novel training strategies that learn fast from few labeled examples. In this senario, the meta-learning setting is adopted to evaluate the performance. 2.2.3.1 Low-shot learning. Low-shot learning approaches mainly focus on how to adapt a pretrained model to novel classes without finetuning the whole deep neural network. Qi et al. (2018) propose to normalize the classifier weights and directly produce the weights of novel classes by averaging the image their image embeddings. Qiao et al. (2018) learn a MLP that regresses classifier weights from its training samples. Wang et al. (2019a) rely on class embedding to generate task-aware feature embedding. Chen et al. (2019) 2.2 few-shot image classification 23 aim to reduce intra-class variations by adopting cosine distance on learned classifier weights. On the other hand, synthesizing data has been a classical way to address the small data problem. In the scenario of few-shot learning, it is natural to investigate how we generate synthetic data for novel classes. Therefore Hariharan and Girshick (2017) propose to generate features from a data point and predefined transformation. Wang et al. (2018c) extend this idea by meta-learning the feature generator. 2.2.3.2 Meta-learning approaches. This field is also called learning to learn. The main idea is to learn a “learning algorithm” that can learn from few examples. One can think the “learning algorithm” as a function that takes input as a training set and outputs the classifiers. They (e.g. Vinyals et al., 2016; Snell et al., 2017) argue that it is beneficial to mimic the few-shot learning scenario on base classes. Therefore, the episode learning scheme is applied on the base class training as well. More specifically, in every training episode, a support set of k-shot, n-way classification problem and a query set including test samples of n classes are sampled. Multi-class classifiers are constructed from the support set (by a “learning algorithm” ) and then evaluated on the query set to compute the loss. Matching networks (Vinyals et al., 2016) meta-learns weighted neareast neighbor classifiers. Prototypical networks (Snell et al., 2017) meta-learns the class prototype and adopt the nearest neighbor classifier as well. Ravi and Larochelle (2016) parameterize the optimization algorithm (SGD) as a LSTM and meta-learns how to optimize the objective function. MAML (Finn et al., 2017) proposes to learn how to initialize the network such that the optimization only takes few steps. Sung et al. (2018) meta-learn a siameses network that predict similarities of two images. Triantafillou et al. (2017) define a training objective that optimizes over all relative orderings of the batch points simultaneously. 2.2.4 Relations to our work In Chapter 4, we propose a unified feature generation framework that works both for zero-shot and few-shot learning. Although our method shares similar idea with other feature generation papers (e.g. Hariharan and Girshick, 2017; Wang et al., 2018c), our feature generator is quite different from existing papers. While hallucinate paper (e.g. Hariharan and Girshick, 2017; Wang et al., 2018c) only generate features from image data, our approaches learns a multi-modal feature generator that synthesizes features from semantic embeddings, which allows better knowledge transfer. In addition, our framework can be applied to the transductive learning setting when the unlabeled examples from novel classes are available. Therefore, our method is more versatile. 24 chapter 2. related work 2.3 zero-shot and few-shot tasks beyond image classifica- tion Most of zero-shot and few-shot learning papers focus on the image classification problem. However, the limitation of labeled data arises in almost all the computer vision tasks, e.g. semantic segmention (e.g. Long et al., 2015; Zhang et al., 2018a; Caesar et al., 2016), obeject detection (e.g. Girshick, 2015; He et al., 2017; Redmon et al., 2016), video action recognition (e.g. Karpathy et al., 2014; Feichtenhofer et al., 2016b), 3D vision (e.g. Riegler et al., 2017; Qi et al., 2017), etc. Although those tasks are as important as the image classification, they are relatively unexplored. While the image classification task is a good starting point to study the zero-shot and few-shot learning problems, it is not always true that few-shot or zero-shot technics for image classification can be directly applied to other vision tasks, for instance, semantic segmentation and video classification. 3D reconstruction is naturally a few-shot problem because it is difficult to acquire 3D data. Wallace and Hariharan (2019) propose a novel method that leverages category- specific priors for few-shot single-image 3D reconstruction problem. For object detection tasks, Bansal et al. (2018) introduce an approach that can localises novel categories in an image. Kang et al. (2019) proposes a feature reweighting technic to address the few-shot object detection task. This section will mainly discuss the applications of zero-shot and few-shot learning in the context of semantic image segmentation and video action recognition. 2.3.1 Semantic image segmentation In contrast to the image classification task which predicts a single label for an entire image, the goal of semantic image segmentation is to assign a class label for each pixel in an image. Popular semantic segmentation methods include FCN (Long et al., 2015), deeplab (Chen et al., 2018), and U-Net (Ronneberger et al., 2015). Learning those models often requires pixel-wise annotations which are expensive and hard to obtain. In order to reduce the annotation efforts, weakly supervised learning with bounding box annotation (Khoreva et al., 2017) has been proposed. We are interested in an orthogonal direction that learns from only a few examples, avoiding collecting and annotating data. The main idea behind that is few-shot learning that aims to achieve generalization on novel classes with only a few examples. The extreme case of few-shot learning is zero-shot learning where novel classes have no example at all. In this section, we will introduce some papers that tackle few-shot and zero-shot semantic segmentation problems. Rakelly et al. (2018) proposes a novel conditional FCN (fully convolutional net- work) learned by the end-to-end optimization. The network takes an annotated support set of images as conditions and performs inference on an unannotated query image. Dong and Xing (2018) propose to learn class prototypes via metric learning. Shaban et al. (2017) introduce a two-branched approach to address the one-shot 2.3 zero-shot and few-shot tasks beyond image classification 25 semantic image segmentation. While the first branch generates parameters from an image, the second branch takes both these parameters and a new image as input and produces a segmentation mask of the image for the new class as output. In the extreme zero-shot learning case, there is no training images for novel classes. Instead, the models rely on semantic class embedding to transfer knowledge from base to novel classes. Zhao et al. (2017a) propose to learn a joint embedding function between visual features per pixel and word2vec embedding per class. Bucher et al. (2019) extend the feature generation idea to image semantic segmentation. 2.3.2 Video action recognition Video understanding is another important field in computer vision. It is challenging because the model has to learn the temporal information in addition to the spatial context. Typical video understanding tasks include video action recognition (e.g. Feichtenhofer et al., 2016b, 2017), video captioning (e.g. Gao et al., 2017), self-driving cars (e.g. Geiger et al., 2012), robotics (e.g. Kemp et al., 2007) etc. While the ResNet (He et al., 2016) has been the widely used image representation network, there is no such “ResNet” in video domain. Representation learning for videos is still an open problem. Similarly, few-shot learning in the context of video understanding is unexplored. In this thesis, we mainly focus on the video action recognition which predicts a single label for a trimmed video. Xu et al. (2015) propose a zero-shot action recognition approach that constructs a mapping from video feature space to the semantic class embedding space. Zhu and Yang (2018) adopt a memory network that stores multiple prototypes for each class. Cao et al. (2019) propose to learn temporal information by solving an video frames alignment problem. 2.3.3 Relations to our work In Chapter 5, we introduce a semantic projection network (SPNet) that handles both zero-label and few-label semantic segmentation tasks. While Zhao et al. (2017a) propose open-vocabulary scene parsing task that segments novel objects by perform- ing hierarchical parsing, we leverage word embeddings to predict the exact unseen classes and address the few-label problem in a unified framework. For few-shot semantic segmentation, previous approaches (e.g. Shaban et al., 2017; Dong and Xing, 2018) follow the meta-learning setup (e.g. Vinyals et al., 2016; Snell et al., 2017), which uses a support set to predict an query image. However, those approaches are restricted to output a binary mask and fail to segment an image with multiple classes. In contrast, our approach is operating in the more realistic (generalized) few-label semantic segmentation setting, i.e. pixel-level labeling of an image where labels come from both base and novel classes. In Chapter 6, we propose a strong model based on 3D CNNs for few-shot video action recognition and introduce more challenging evaluation settings for future research. Comparing to previous approaches (e.g. Zhu and Yang, 2018; Cao et al., 26 chapter 2. related work 2019) which extract frame-level features, our model extract clip-level features via 3D CNNs such that temporal information is better captured. In addition, our evaluation is more challenging and realistic than previous ones. We observe that our model saturates previous evaluation settings and therefore introduce more challenging many-way few-shot learning and generalized few-shot learning settings for future research. 3 L A T E N T E M B E D D I N G F O R Z E R O - S H O T I M A G E C L A S S I F I C A T I O N Contents 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2 Background: Bilinear Joint Embeddings . . . . . . . . . . . . . . . . 30 3.3 Latent Embeddings Model (LatEm) . . . . . . . . . . . . . . . . . . . 31 3.3.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3.3 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.4.1 Zero-shot Learning Experiments . . . . . . . . . . . . . . . . 37 3.4.2 Generalized Zero-shot Learning Setting . . . . . . . . . . . . 44 3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 I n this chapter, we present an approach for learning a compatibility function between image and class embedding spaces for image classification when labeled training data is scarce. The proposed method augments the state-of-the-art bilinear compatibility methods (e.g. Akata et al., 2015a,b; Frome et al., 2013) by by incorporating latent variables. Instead of learning a single bilinear map, our novel latent embedding model learns a collection of bilinear maps with the selection of which map to use being a latent variable for the current image-class pair. We empirically demonstrate the strength of our model with respect to six state-of-the-art models (e.g. Akata et al., 2015b; Romera-Paredes et al., 2015; Zhang and Saligrama, 2015; Socher et al., 2013; Zhang and Saligrama, 2016) on three challenging datasets i.e. AWA (Lampert et al., 2013), CUB (Welinder et al., 2010) and Dogs (Khosla et al.) using four different class embeddings. In addition to zero-shot learning experiments, we provide an extensive analysis of our method on few-shots and generalized zero-shot learning settings. This chapter takes the first step towards the few-shot learning and more realistic generalized zero-shot learning setting. In Chapter 4, we evaluate the approaches introduced in this chapter as well as other SOTA approaches under the same evalua- tion protocol. In Chapter 5, we show that feature generation is an effective way to address generalized zero-shot learning. In Chapter 6, we demonstrate that unlabeled data improves the feature generation, leading to significantly better any-shot learning performance i.e., zero-shot and few-shot learning. 27 28 chapter 3. latent embedding for zero-shot image classification 3.1 introduction Humans are highly capable of recognizing novel object categories using some form of external information, without seeing any actual visual example of that category. Enabling computers with this capability has been recently introduced as zero-shot learning task in the intersection of computer vision and machine learning. Zero- shot learning (e.g. Bart and Ullman, 2005; Palatucci et al., 2009; Lampert et al., 2013; Larochelle et al., 2008; Yu and Aloimonos, 2010) has been formally posed as follows: labeled images are provided for certain visual classes during training and the task is to learn a model that can make predictions for novel classes at test time. As training and test class sets are disjoint, namely there are no visual examples are provided for some classes during training, the standard supervised image classification frameworks that use class labels cannot be employed. Although object class labels are not available, a list of attributes (e.g. Ferrari and Zisserman, 2007; Farhadi et al., 2009; Lampert et al., 2013), a set of easily recognizable properties of objects such as furry, spotted etc. provide a structured relationships between class labels that facilitates the required induction. Substantial progress has been made for zero-shot learning task (e.g. Duan et al., 2012; Farhadi et al., 2010; Ferrari and Zisserman, 2007; Kankuekul et al., 2012; Lampert et al., 2013; Parikh and Grauman, 2011; Papadopoulos et al., 2014; Akata et al., 2015c). This progress can be attributed to two recent advances. First, representation learning using deep neural networks (e.g. Krizhevsky et al., 2012; Szegedy et al., 2015) provides image embeddings which perform well across a range of visual classification tasks (e.g. Razavian et al., 2014). Second, multi-modal structured embedding frameworks (e.g. Akata et al., 2015a,c; Frome et al., 2013; Romera-Paredes et al., 2015) provide a means to measure the compatibility between image and class representations. While noting the parallel progress in image representations, i.e. via deep neural networks (He et al., 2016), in this work, we focus on improving the compatibility learning framework. Compatibility learning frameworks (e.g. Akata et al., 2015a,c; Frome et al., 2013; Hastie et al., 2008; Palatucci et al., 2009; Romera-Paredes et al., 2015; Socher et al., 2013; Xian et al., 2016; Fu and Sigal, 2016; Qiao et al., 2016; Akata et al., 2016; Bucher et al., 2016; Mensink et al., 2014; Fu et al., 2015b; Kodirov et al., 2015) are generally based on the idea of representing both the images and the classes in (respective) multi- dimensional vector spaces. Image embeddings are obtained from state-of-the-art image representations e.g. those from deep convolutional neural networks (e.g. Krizhevsky et al., 2012; Szegedy et al., 2015). Class embeddings can be obtained using manually specified side information e.g. attributes (Lampert et al., 2013), extracted automatically from an large but unlabeled large text corpora (e.g. Mikolov et al., 2013b; Pennington et al., 2014) etc. A compatibility function is then learned with a discriminative objective that decreases the distance, in the embedded space, between images from the same class while increasing that between images from different classes. Once learned, such a compatibility function can be used to predict the class (more precisely, the class embedding) of any given image. The predicted 3.1 introduction 29 Class Embedding F Image Embedding Blue Jay curved beak Albatross long neck black feet long tail blue back F Blue bird with long tail This is a black water bird Class Embedding Image Embedding W Blue Jay Albatross Class Embedding Image Embedding F Blue Jayblue black Albatross Class Embedding Image Embedding Large error W Large error Class Embedding F Image Embedding Blue Jay curved beak Albatross long neck black feet long tail blue back F Blue bird with long tail This is a black water bird Class Embedding Image Embedding W Blue Jay Albatross Class Embedding Image Embedding F Blue Jayblue black Albatross Class Embedding Image Embedding Large error W₁ W₂ Figure 3.1: Compatibility learning frameworks that use a linear projection, e.g. SJE Akata et al. (2015c) (figure on the left) may lead to a large projection error, however learning a piece-wise linear model (figure on the right) leads to more precise projections. Here, crosses represent image embeddings and their projections on the class embedding space, W are the parameters of the compatibility function, solid circles represent the ground truth class embedding. embedding vector might not correspond to a known class label. Therefore in practice, the nearest embedding corresponding to a class label is taken as the class prediction. Advantageously, this can then be done for images belonging to both seen and unseen classes, hence enabling zero-shot classification. State-of-the-art compatibility learning frameworks for zero-shot learning (e.g. Akata et al., 2015a,c; Frome et al., 2013; Romera-Paredes et al., 2015) use a linear compatibility function to learn the model. However, learning a linear compatibility function is not sufficient for the challenging fine-grained classification problem. A model that can automatically group objects with similar properties together and then learn different compatibility models, adapted for different groups, is expected to perform better for fine-grained classification. For instance, two different linear functions that separate blue birds with brown wings and from other blue birds with blue wings can be learned separately. With such motivation, we propose a novel model for zero-shot classification which incorporates latent variables to learn a piecewise linear compatibility function between image and class embeddings. The approach is inspired by many recent advances in visual recognition that utilize latent variable models, e.g. object detection (e.g. Felzenszwalb et al., 2010; Hussain and Triggs, 2010), human pose estimation (Yang and Ramanan, 2011) and face detection (Zhu and Ramanan, 2012). Our contributions are as follows. First, we propose a novel method for zero-shot learning. By incorporating latent variables in the compatibility function our method achieves factorization over such (possibly complex combinations of) variations in pose, appearance and other factors. Instead of learning a single linear function, we propose to learn a collection of linear models while allowing each image-class pair to choose from them. This effectively makes our model non-linear, as in different local regions of the space the decision boundary, while being linear, is different. We use an efficient stochastic gradient descent (SGD) based learning method. Second, we propose a fast and effective method for model selection by successive pruning of an over-complete initialization. We show that such a strategy is competitive compared to standard cross-validation based model selection, while being much 30 chapter 3. latent embedding for zero-shot image classification faster to train. Third, we evaluate our novel piece-wise linear model for zero-shot and generalized zero-shot learning setting with various class embeddings (e.g. Mikolov et al., 2013b; Pennington et al., 2014; Miller, 1995) on three challenging datasets, i.e. Caltech-UCSD Birds 200-2011 (CUB) (Welinder et al., 2010), Animals With Attributes (AWA) (Lampert et al., 2013) and Stanford Dogs 2 (Dogs) (Khosla et al.). We compare our method on all these configurations with several related existing embedding methods. We show that incorporating latent variables in the compatibility learning framework consistently improves the state-of-the-art for zero- shot learning setting. Fourth, we extensively evaluate our novel piecewise linear model for zero-shot and generalized zero-shot learning settings on various aspects such as stability, interpretability, generalizability to seen and unseen classes. We raise awareness for the challenge of transferring information from zero-shot setting to full multi-class setting and aim to inspire further research in this direction. In section 4.2, we present an extensive discussion of related work. In section 3.2 we give details of the bilinear compatibility learning framework that our method is based on. In section 4.3 we present our novel Latent Embedding framework which extends the bilinear compatibility learning framework to nonlinearity through learning several piece-wise linear models that each capture a different latent aspect of the data. In section 3.4 we evaluate our Latent Embedding framework with respect to several criteria both on zero-shot and on generalized zero-shot learning settings. In section 3.5 we conclude with a discussion and potential future directions. 3.2 background: bilinear joint embeddings In this section, we describe the bilinear joint embedding framework (e.g. Akata et al., 2015c,a; Weston et al., 2011), on which we build our Latent Embedding Model that will be detailed in section 4.3. We work in a supervised setting where we are given an annotated training set T = {(x, y)|x ∈X ⊂ Rdx , y ∈Y ⊂ Rdy}, (3.1) where x is the image embedding defined in an image feature space X , e.g. CNN features (Krizhevsky et al., 2012), and y is the class embedding defined in a label space Y that models the conceptual relationships between classes, e.g. attributes (e.g. Farhadi et al., 2009; Lampert et al., 2013). The goal is to learn a function f : X →Y to predict the correct class for the query images. In previous work (e.g. Weston et al., 2011; Akata et al., 2015a,c), this is done via learning a function F : X ×Y → R that measures the compatibility between a given input embedding x ∈ X and an output embedding y ∈Y. The prediction function then chooses the class with the maximum compatibility, i.e. f (x) = arg max y∈Y F(x, y). (3.2) 2We use the 113 class subset of the Stanford Dogs dataset as in (Akata et al., 2015c) 3.3 latent embeddings model (latem) 31 In general, the class embeddings reflect the common and distinguishing properties of different classes using side-information that is extracted independently of images e.g. attributes of classes. Using these embeddings, the compatibility can be computed even with those unknown classes which have no corresponding images in the training set. Therefore, this framework can be applied to zero-shot learning (e.g. Akata et al., 2015a,c; Palatucci et al., 2009; Romera-Paredes et al., 2015; Socher et al., 2013). In previous work, the compatibility function takes a simple form, F(x, y) = x>Wy (3.3) with the matrix W ∈ Rdx×dy being the parameter to be learnt from training data. Due to the bilinearity of F in x and y, previous workg (e.g. Akata et al., 2015a,c; Weston et al., 2011) refer to this model as a bilinear model, however one can also view it as a linear one since F is linear in the parameter W. In the following, these two terminologies will be used interchangeably depending on the context. 3.3 latent embeddings model (latem) In general, the linearity of the compatibility function in Equation 3.3 is a limitation as the problem of image classification is usually a complex nonlinear decision problem. Linear decision functions can be extended to nonlinear ones through the use of piecewise linear decision functions. Achieving non-linearity through piece-wise linearity has been used successfully in various models for solving computer vision tasks such as mixture of templates (Hussain and Triggs, 2010) and deformable parts- based model (Felzenszwalb et al., 2010) for object detection, mixture of parts for pose estimation (Yang and Ramanan, 2011) and face detection (Zhu and Ramanan, 2012). The main idea in most of such models, along with modeling parts, is that of incorporating latent variables, e.g. the different templates in the mixture of templates Hussain and Triggs (2010) and the different ‘components’ in the deformable parts model (Felzenszwalb et al., 2010). Therefore, the model becomes a collection of linear models. The test images then pick one of these linear models, with the selection being latent and image specific. Intuitively, this factorizes the decision function into components which focus on distinctive ‘clusters’ in the data, e.g. one component may focus on the profile view while another on the frontal view of the object. Incorporating nonlinearity in this way has been shown (e.g. Felzenszwalb et al., 2010; Hussain and Triggs, 2010; Yang and Ramanan, 2011; Zhu and Ramanan, 2012) to improve performance. In the following subsections, we will detail our novel LatEm model that extends bilinear joint embedding model to nonlinearity through a piece-wise linear formu- lation. We discuss our optimization algorithm, model selection and finalize with a discussion. 32 chapter 3. latent embedding for zero-shot image classification 3.3.1 Objective We propose to construct a nonlinear, albeit piecewise linear, compatibility function. Parallel to the latent SVM formulation, we propose a non-linear compatibility function as follows. F(x, y) = max 1≤i≤K w̃>i (x ⊗ y), (3.4) where i = 1, . . . , K, with K ≥ 2, indexes over the latent choices and w̃i ∈ Rdx dy are the parameters of the individual linear components of the model. This equation can be reformulated as a mixture of bilinear compatibility functions (Equation 3.3), F(x, y) = max 1≤i≤K x>Wiy. (3.5) Our goal here is to learn the set of parameters {Wi} of the above compatibility funtion that minimizes the empirical risk given as 1 N |T | ∑ n=1 L(xn, yn). (3.6) where L : X ×Y → R is the loss function defined for a particular example (xn, yn) as L(xn, yn) = ∑ y∈Y [∆(yn, y) + F(xn, y)− F(xn, yn)]+ , (3.7) with ∆(yn, y) being the zero-one loss defined as, ∆(y, yn) = { 1 if y 6= yn 0 otherwise (3.8) and [a]+ = max(0, a) bounds the Equation 3.6 from above. This ranking-based loss function has been previously used in Akata et al. (2015a); Frome et al. (2013); Weston et al. (2011) such that the model is trained to produce a higher compatibility between the matching image and class embedding than the mismatching image and class embedding. Note that by setting K = 1, our LatEm framework generalizes to bilinear joint embedding framework as each of the Wi leads to a bilinear compatibility defined in Equation 3.3, while the full compatibility function becomes nonlinear owing to the max operator. 3.3.2 Optimization Even though F is convex, we first observe that the ranking loss function L from Equation 3.7 is not jointly convex in all the Wi’s. Thus, finding a globally optimal solution, which was practical due to convexity in the previous linear models (e.g. Akata et al., 2015a,c), is difficult now. To minimize the empirical risk in Equation 3.6, 3.3 latent embeddings model (latem) 33 Algorithm 1 SGD optimization for LatEm T = {(x, y)|x ∈ Rdx , y ∈ Rdy} 1: for all t = 1 to T do 2: for all n = 1 to |T | do 3: Draw (xn,yn) ∈T and y ∈Y \{yn} 4: if F(xn, y) + 1 > F(xn, yn) then 5: i∗ ← argmax 1≤k≤K x>n Wky 6: j∗ ← argmax 1≤k≤K x>n Wkyn 7: if i∗ = j∗ then 8: W t+1i∗ ← W t i∗ − ηtxn(y − yn) > 9: end if 10: if i∗ 6= j∗ then 11: W t+1i∗ ← W t i∗ − ηtxny > 12: W t+1j∗ ← W t j∗ + ηtxny > n 13: end if 14: end if 15: end for 16: end for we propose a simple SGD-based method that works in the same fashion as in the convex setting. Our LatEm method, while possibly leading to only local minima, performs well in practice as shown in section 3.4. The details of the SGD optimization of our LatEm method (Algorithm 1) are as follows. Given a training set T = {(x, y)|x ∈ Rdx , y ∈ Rdy} of image embeddings, i.e. x and their associated class embeddings, i.e. y, we loop through all our samples for a certain number of epochs T. For each sample (xn, yn) in the training set, we randomly select a y that is different from yn (step 3 of Algorithm 1). If the randomly selected y violates the margin condition (step 4 in Algorithm 1), then we update the Wi matrices following the steps 5 − 13 in Algorithm 1. In particular, we find the Wi that leads to the maximum score for y (step 5) and the Wj that gives the maximum score for y (step 6). If the same matrix gives the maximum score, the condition on step 7 in Algorithm 1 has been satisfied so we update that matrix. If two different matrices lead to the maximum score which corresponds to the condition formulated on step 9 in Algorithm 1, we update both matrices, i.e. Wi∗ and Wj∗ using the sub-gradient based updates formulated on steps 11 and 12. 3.3.3 Model selection The number of matrices K in the model is a free parameter. We use two strategies to select the number of matrices. As the first method, we use a standard cross-validation strategy, i.e. we split the dataset randomly into disjoint parts (in a zero-shot setup) 34 chapter 3. latent embedding for zero-shot image classification and choose the K with the best cross-validation performance. We denote this strategy as CV in the following sections. While this is a well established strategy which we find to work well in practice, we also propose a pruning based strategy which is competitive while being faster to train. In pruning based strategy, we start with a relatively large number of matrices and prune them as follows. As the training proceeds, each sampled training examples chooses one of the matrices for scoring – we keep track of this information and build a histogram over the number of matrices counting how many times each matrix was chosen by any training example. In particular, this is done by increasing the counter for Wj∗ by 1 after step 6 of Algorithm 1. With this information, after five passes over the training data, we prune out the matrices which were chosen by less than 5% of the training examples, so far. This is based on the intuition that if a matrix is being chosen only by a very small number of examples, it is probably not critical for performance. With this model pruning approach we have to train only one model which adapts itself, instead of training multiple models for cross-validating K and then training a final model (with full training data) for the chosen K. 3.3.4 Discussion In the zero-shot learning setting, during training, we have a set of seen classes Ytr+val = {y1, . . . , yN1} and a set of unseen classes Yts = {yN1+1, . . . , yN1+N2} with Ytr+val ∩Yts = φ. In addition, all the classes have been assumed to be embedded into a multidimensional real space which connects them via some form of semantics. For example, each class may be written as a binary vector indicating the presence of absence of predefined attributes e.g. furry, has tail, can swim. During training we are provided with annotated training images belonging to the classes in Ytr+val , while at testing we are required to make predictions for images belonging to the classes in Yts. Zero-shot learning can be achieved by using any compatibility learning model, such as the bilinear compatibility based model presented in section 3.2, as there is no class specific parameter being learnt (cf. multi-class SVM models) but only a global parameter W which maps the image embeddings to class embeddings (and vice-versa). We build upon the SJE model presented in section 3.2 for the task of zero-shot learning and now discuss the differences between LatEm and SJE to emphasize our technical contributions. LatEm learns a piecewise linear compatibility function through multiple Wi ma- trices whereas SJE (Akata et al., 2015c) is linear. With multiple Wi’s the compatibility function has the freedom to treat different types of images differently. Let us consider a fixed class ŷ and two substantially visually different types of images x1, x2, e.g. the same bird flying and swimming. In SJE (Akata et al., 2015c) these images will be mapped to the class embedding space with a single mapping W>x1, W>x2. On the other hand, LatEm will have learned two different matrices for the mapping i.e. W>1 x1, W > 2 x2. In the former case, a single W has to map two visually, and hence numerically, very different vectors (close) to the same point. In the latent case as two 3.4 experiments 35 different mappings are factorized separately, therefore the “flying” and “swimming” bird will be mapped to two separate points. Such factorization is also expected to be advantageous when two classes that share partial visual similarity are to be discriminated. For instance, while blue birds could be relative easily distinguished from red birds, to do so for different types of blue birds is harder. In such cases, one of the Wi’s could focus on color while another one could focus on the beak shape (in section 3.4 we show that this effect is visible). The task of discrimination against different bird species would then be handled only by the second one. This way of factorizing enables for a more disctiminative classification model. LatEm uses the ranking based loss (Weston et al., 2011) in Equation 3.7 whereas SJE (Akata et al., 2015c) uses the multiclass loss of Crammer and Singer (Crammer and Singer, 2002) which replaces the ∑ in Equation 3.7 with max. The SGD algorithm for multiclass loss of Crammer and Singer (Crammer and Singer, 2002) requires at each iteration a full pass over all the classes to search for the maximum violating class. Therefore it can happen that some matrices will not be updated frequently. On the other hand, the ranking based loss in Equation 3.7 used by our LatEm model ensures that different latent matrices are updated frequently. Thus, the ranking based loss in Equation 3.7 is better suited for our piecewise linear LatEm model. 3.4 experiments In this section, first we detail our experimental setup in our evaluation procedure and finally report experimental results on zero-shot and generalized zero-shot learning settings. Datasets. Caltech-UCSD Birds (CUB) (Welinder et al., 2010) and Stanford Dogs (Dogs) (Khosla et al.) are fine-grained datasets (e.g. Duan et al., 2012; Deng et al., 2013) and Animals With Attributes (AWA) (Lampert et al., 2013) is a coarse-grained dataset. All the three datasets have been used for zero-shot learning (e.g. Akata et al., 2015c; Rohrbach et al., 2011; Kankuekul et al., 2012; Yu and Aloimonos, 2010) in the literature. As shown on Table 3.1, the set of classes are divided into three disjoint sets of train (Ytr), val (Yv) and test (Yts) classes. For a fair comparison with previous works, we follow the same train, val, test set split used by (Akata et al., 2015c). In zero-shot learning, i.e. Ytr+v ∩Yts = 0, to get a more stable estimate of our own results, we make four more splits by randomly sampling the same number of classes as before. Unless indicated otherwise, e.g. in comparison with previous methods, we average results over five splits. We account for the imbalance in the number of images in AWA and Dogs datasets and measure per-class averaged Top-1 accuracy, unless stated otherwise. In generalized zero-shot learning setting as shown on Table 3.2, the set of images that belong to Ytr+v and Yts is first divided equally into tr+v and ts sets. Namely, following the same seen (Ytr+v) and unseen (Yts) class split as the zero-shot learning setting, we build tr+v and ts sets of images that belong to seen and unseen classes. This way we can evaluate our model on images that belong to only ts or both tr+v 36 chapter 3. latent embedding for zero-shot image classification Total train+val test img Y img Ytr Yv img Yts CUB 11788 200 8855 100 50 2931 50 AWA 30475 50 24293 30 10 6180 10 Dogs 19499 113 14681 57 28 4818 28 Table 3.1: The statistics of CUB, AWA and Dogs datasets in zero-shot setting. CUB and Dogs are fine-grained datasets whereas AWA is a more general concept dataset. Ytr+v and Yts are seen and unseen class embeddings respectively. img cls img cls tr+v ts Ytr+v tr+v ts Yts CUB 4495 4360 150 1499 1434 50 AWA 12176 12119 40 3062 3118 10 Dogs 7317 7364 85 2433 2385 28 Table 3.2: The statistics of CUB, AWA and Dogs datasets in the generalized zero-shot learning setting. and ts classes. Image and class embeddings. For direct comparison with the state-of-the-art, we use embeddings provided by (Akata et al., 2015c). Briefly, as image embed- dings we use the 1024 dimensional top-layer pooling units of the pre-trained GoogleNet (Szegedy et al., 2015) extracted from the whole image. We do not do any task specific pre-processing on images such as cropping foreground objects. As class embeddings we evaluate four different alternatives, i.e. attributes (att) (Lampert et al., 2013), word2vec (w2v) (Mikolov et al., 2013b), glove (glo) (Pennington et al., 2014) and hierarchies (hie) (Miller, 1995). Note that, CUB contains 312 and AWA contains 85 attributes. Our att embedding for a class is a vector measuring the strength of each attribute for that class, based on human judgment. On the other hand, w2v and glo are 400 dimensional whereas hie is ≈ 200 dimensional. Implementation details. Our image features are z-score normalized such that each dimension has zero mean and unit variance. All the class embeddings are `2 normalized. The matrices Wi are initialized at random with zero mean and standard deviation 1√ dx (Akata et al., 2015a). The number of epochs is fixed to be 150. The learning rates for the CUB, AWA and Dog datasets are chosen as ηt = 0.1, 0.001, 0.01, respectively, and kept constant over iterations. For each dataset, these parameters are tuned on the validation set of the default dataset split and kept constant for all other dataset splits and for all class embeddings. We use two strategies for selecting the number of latent matrices K, i.e either cross-validation or pruning. For cross-validation, K is varied in {2, 4, 6, 8, 10} and the optimal K is chosen based the 3.4 experiments 37 CUB AWA Dogs att w2v glo hie att w2v glo hie w2v glo hie ESZSL 30.5 23.7 7.1 2.1 65.3 29.3 38.4 52.2 10.0 6.5 21.3 ESZSL* 47.1 33.7 33.3 23.2 68.8 57.4 61.7 55.1 21.6 20.0 22.1 CMT 29.4 24.8 25.8 17.9 54.9 46.6 47.6 40.1 13.7 16.7 14.8 SSE 42.1 28.4 24.9 21.4 64.8 60.4 65.8 55.8 20.5 18.9 29.9 JLSE 37.6 28.4 29.9 20.3 67.5 49.7 56.4 39.3 26.2 16.4 23.7 SJE 50.1 28.4 24.2 20.6 66.7 51.2 58.8 51.2 19.6 17.8 24.3 LatEm (Ours) 45.5 31.8 32.5 24.2 71.9 61.1 62.9 57.5 22.6 20.9 25.2 Table 3.3: Average per-class top-1 accuracy in zero-shot setting on AWA, CUB and Dogs datasets. We compare ESZSL (Romera-Paredes et al., 2015), ESZSL* (Romera- Paredes et al., 2015), CMT (Socher et al., 2013), SSE (Zhang and Saligrama, 2015), JLSE (Zhang and Saligrama, 2016), SJE (Akata et al., 2015c) and Latent Embedding model (K is cross-validated) using the same splits, image and class embeddings as in (Akata et al., 2015c). accuracy on a validation set. For pruning, unless stated otherwise, K is initially set to be 16 and then at every fifth epoch during training, we prune all matrices that support less than 5% of the data points. 3.4.1 Zero-shot Learning Experiments In this section, we provide results on zero-shot learning setting where Ytr ∩Yv ∩ Yts = 0. In this setting, at training time, LatEm has access to labeled images of Ytr+v and the search space at test time is Yts. We either use the splits provided by (Akata et al., 2015c) or report the average performance of five splits to show stability. We specify the splits we used for each experiment in their respective sections. Comparison with State-of-the-Art. We start our experimental evaluation with an analysis of (Lampert et al., 2013) and quantitative comparisons with ESZSL (Romera- Paredes et al., 2015), CMT (Socher et al., 2013), SSE (Zhang and Saligrama, 2015), JLSE (Zhang and Saligrama, 2016), and SJE (Akata et al., 2015c) which are among the most relevant related work to ours. Note that we fairly re-evaluate all seven state-of-the-art methods using the same four class embeddings, the same image embeddings and the same evaluation criteria on three challenging zero-shot learning datasets. Therefore, ours is one of the most comprehensive re-evaluation of zero-shot state-of-the-art. Among competing state-of-the-art methods, (Lampert et al., 2013) proposes a two-step method that follows a different principle than ours: (1) Learning attribute classifiers and (2) Combining the scores of these attribute classifiers to make a class prediction. Typically, the positive/negative samples used to train the attribute classifiers are obtained by binarizing the class-attribute matrix wrt. a threshold, 38 chapter 3. latent embedding for zero-shot image classification CUB AWA Dogs PR CV PR CV PR CV att 3 4 7 2 n/a w2v 8 10 8 4 6 8 glo 6 10 7 6 9 4 hie 8 2 7 2 11 10 Table 3.4: Number of matrices selected using pruning (PR) and using cross-validation (CV). PR is obtained by K0 = 16. that leads to loss of information. As it is not clear how to extend this idea to unsupervised class embeddings, we compare (Lampert et al., 2013) and LatEm using attributes on AWA where (Lampert et al., 2013) obtains 56.2% whereas LatEm obtains 71.9% accuracy which is mostly due to binary attributes. On the other hand, we emphasize that we focus on unsupervised class embeddings that do not require human supervision. Additionally, we re-implemented (Romera-Paredes et al., 2015) following the paper because their method is embarrassingly simple. (Romera- Paredes et al., 2015) define a binary matrix Y of size m × z to denote the ground-truth labels of m training instances belonging to any of the z classes. The scale of this matrix has been given as Y ∈{−1, 1}m×z in (Romera-Paredes et al., 2015) which is a parameter to tune. Therefore, we also validate our results with Y ∈ {0, 1}m×z. We denote the experiment that uses Y ∈{0, 1}m×z as (Romera-Paredes et al., 2015)*. For our experiments, we got the code from the authors of (Socher et al., 2013), (Zhang and Saligrama, 2015), and (Zhang and Saligrama, 2016) and we use the publicly available implementation of SJE (Akata et al., 2015c). We ran the experiments using our image and class embeddings by carefully validating all the parameters of all the methods on the validation set. We present results in Table 3.3. Our LatEm consistently outperforms (Socher et al., 2013) and (Romera-Paredes et al., 2015) on all three datasets for all four class embeddings. We observe a significant increase in accuracy from ESZSL (Romera- Paredes et al., 2015) to ESZSL* (Romera-Paredes et al., 2015) in all cases. However, even with Y ∈ {0, 1}m×z, our LatEm still outperforms ESZSL*(Romera-Paredes et al., 2015) in 8 out of 11 cases. On the other hand, our LatEm outperforms (Zhang and Saligrama, 2015) in 9 out of 11 cases and (Zhang and Saligrama, 2016) in 10 out of 11 cases. For (Zhang and Saligrama, 2015) λ1, λ2, γ are the three regularization parameters, also the number of iterations and number of sample pairs are hyperparameters to tune whereas (Zhang and Saligrama, 2016) requires the regularization λs, dictionary size, number of sample pairs and number of iterations to be tuned. Note that, apart from doing an extensive parameter validation, we used exactly the same SVM solver and quadratic programming solver with (Zhang and Saligrama, 2015) and (Zhang and Saligrama, 2016) to obtain the results in Table 3.3. Being a competitive state-of-the-art and the closest work related to ours, we now 3.4 experiments 39 CUB AWA Dog att w2v glo hie SJE LatEm SJE LatEm SJE LatEm cnc X X X 45.1 42.0 71.3 64.5 n/a n/a cmb 51.0 46.2 73.5 73.6 n/a n/a cnc X X X 42.2 39.7 73.3 70.7 n/a n/a cmb 51.7 46.6 73.9 75.7 n/a n/a cnc X X 28.2 30.7 53.9 59.7 23.5 30.0 cmb 29.4 33.2 55.5 62.2 26.6 33.8 cnc X X 28.5 31.3 60.1 71.1 23.5 25.9 cmb 29.9 32.6 59.5 64.8 26.7 26.8 Table 3.5: Class embeddings combined as in (Akata et al., 2015c) (cnc: early fusion of class embeddings, cmb: late fusion of scores). provide a detailed comparison with SJE (Akata et al., 2015c) and our LatEm. Using att, LatEm improves over SJE on AWA (71.9% vs. 66.7%) significantly. However, as our aim is to reduce the accuracy gap between supervised and unsuper- vised class embeddings, therefore we focus on w2v, glo and hie embeddings. Here, on all datasets, LatEm improves the SJE (Akata et al., 2015c) (section 3.2) significantly. With w2v, LatEm achieves 31.8% (vs. 28.4%) on CUB, 61.1% (vs. 51.2%) on AWA and finally 22.6% (vs 19.6%) accuracy on Dogs. Similarly, using glo, LatEm achieves 32.5% (vs 24.2%) on CUB, 62.9% (vs. 58.8%) on AWA and 20.9% (vs. 17.8%) accuracy on Dogs. Finally, while LatEm with hie on Dogs improves the result to 25.2% from 24.3%, the improvement is more significant on CUB (24.2% from 20.6%) and on AWA (57.5% from 51.2%). These results place our LatEm in the context with most recent and relevant methods as well as establish it as another competitive state-of-the-art method for zero-shot learning on three datasets. The results are encouraging, as they quantitatively show that learning piece-wise linear latent embeddings indeed capture latent semantics on the class embedding space. Here, we emphasize two disadvantages of attributes. First, since fine-grained object classes share many common properties we need a large number of attributes which is costly to obtain. Second, attribute annotations need to be done on a dataset basis, i.e. the attributes collected for birds do not work with dogs. Therefore, we stress the importance of the unsupervised class embeddings i.e. w2v, glo, hie. Pruning versus cross-validation for model selection. Our aim is to determine if our LatEm selects different number of models through pruning and through cross- validation. Pruning (PR) selects matrices based on the data itself, on the other hand, cross-validation (CR) validates the number of matrices necessary to obtain the highest accuracy on the validation set. Table 3.4 presents the results of this experiment on splits provided by (Akata et al., 2015c). We set the initial number of embeddings K0 to 16 and pruning threshold to 1/K0 which assumes that samples are equally distributed to each embedding. In terms 40 chapter 3. latent embedding for zero-shot image classification of the model size, cross validation seems to have a slight advantage. It selects a smaller model in 7 cases out of 11 which is more space and time efficient. The trend is consistent for all the class embeddings for the AwA dataset but is mixed for CUB and Dogs. The advantage of pruning over cross-validation is that it is much faster to train. While cross validation requires training and testing with multiple models (once each for every possible choice of K), pruning just requires training once. We measure the sensitivity of K0 and corresponding pruning thresholds by setting K0 = [10, 12, 14, 16, 18, 20, 22] and th = 1/10, 1/12, 1/14, . . . , 1/22. Mean accuracy with standard deviation with att, w2v, glo, hie on CUB are 44.9% (0.6), 32.4% (0.7), 31.6% (1.3), 22.8% (0.9) which shows that the results we reported with K0 = 16 is stable. Combination of class embeddings. Here, we provide results with direct comparison with (Akata et al., 2015c) where class embeddings are combined using two strategies: (1) through early fusion (cnc), i.e. concatenating class embeddings and (2) through late fusion (cmb) of compatibility scores, i.e. averaging the scores obtained with different class embeddings. We use the same combination of class embeddings, image features and zero-shot splits as (Akata et al., 2015c) for a fair comparison. The results are presented in Table 3.5. First, we combine att with w2v, glo and hie for AWA and CUB. LatEm improves the results over SJE significantly on AWA (75.7% vs 73.9%). On the other hand, LatEm does not improve over the state-of-the-art (46.6% vs 51.7%) on CUB. This observation is in line with the results reported in Table 3.3 where LatEm does not provide a significant advantage over SJE on CUB with human-annotated attributes. Second, we combine unsupervised class embeddings w2v, glo and hie. LatEm consistently improves over SJE in this setting. On CUB combining w2v, glo and hie achieves 34.9% (vs. 29.9%), on AWA it achieves 66.2% (vs. 60.1%) and on Dogs it obtains 36.3% (vs. 35.1%). These experiments show that unsupervised class embed- dings contain complimentary information and, therefore, the results tend to improve by combining them. Another observation is late fusion of classification scores, i.e. cmb, leads to higher accuracy compared to early fusion of class embeddings, i.e. cnc. In cnc, a single Wi, learned with all the class embeddings concatenated together, fails to address the confusion that is introduced by each class embedding. On the other hand, in cmb, each Wi prefers to assign a different class label to an image based on the score, i.e. F(x, y). This way, different Wis that are learned with different but complimentary class embeddings get weighted accordingly and, hence, class labels are more accurate. Finally, on CUB and Dogs by combining w2v and hie we obtain better results than by combining glo and hie. This is due to the fact that glo uses only class-relevant articles while w2v uses the entire wikipedia. As a conclusion, wikipedia articles that are not directly related to our classes, i.e. the context, lead to more descriptive class embeddings individually (see w2v results in Table 3.3) and in combination as well (see results in Table 3.5 that include w2v). Stability of zero-shot learning results. As during training time in zero-shot learning 3.4 experiments 41 CUB AWA Dogs SJE LatEm SJE LatEm SJE LatEm att 49.5 45.6 70.7 72.5 n/a w2v 27.7 33.1 49.3 52.3 23.0 24.5 glo 24.8 30.7 50.1 50.7 14.8 20.2 hie 21.4 23.7 43.4 46.2 24.6 25.6 Table 3.6: Average per-class top-1 accuracy on unseen classes (the results are averaged on five folds). SJE: (Akata et al., 2015c), LatEm: Latent embedding model (K is cross- validated). CUB AWA Dogs PR CV PR CV PR CV att 43.8 45.6 63.2 72.5 n/a w2v 33.9 33.1 48.9 52.3 25.0 24.5 glo 31.5 30.7 51.6 50.7 18.8 20.2 hie 23.8 23.7 45.5 46.2 25.2 25.6 Table 3.7: Average per-class top-1 accuracy on unseen classes (averaged over five zero-shot splits that we used in the stability experiments). PR: proposed model learnt with pruning using K0 = 16, CV: with cross validation. neither images nor class relationships of test classes are seen, methods suffer from the difficulty in parameter selection. The standard way is to use disjoint train, val and test classes. In addition to the standard splits, we experimented on four more independently and randomly chosen data splits to get stable estimates of our predictions. Both with our LatEm and the publicly available implementation of SJE (Akata et al., 2015c) we repeat these experiments five times and report the average. For all datasets Table 3.6 shows that all the result comparisons between SJE and LatEm hold and therefore conclusions are the same. Although SJE outperforms LatEm with supervised attributes on CUB, LatEm outperforms the SJE results with supervised attributes on AWA and consistently outperforms all the SJE results obtained with unsupervised class embeddings. Using attributes, on AWA LatEm obtains an impressive 72.5% (vs. 70.5%) and using unsupervised class embeddings the highest accuracy is observed with w2v with 52.3% (vs. 49.3%). On CUB, LatEm with w2v obtains the highest accuracy with 33.1% (vs. 27.7%) On Dogs, LatEm with hie obtains the highest accuracy, i.e. 25.6% (vs 24.6%). These results insure that our accuracy improvements reported in Table 3.3 were not due to a bias in the dataset split. By augmenting the datasets with four more splits, our LatEm obtains a 42 chapter 3. latent embedding for zero-shot image classification 1 2 4 6 8 10 K 0 10 20 30 40 50 60 T o p -1 A cc . (i n % ) CUB w2v glo hie 1 2 4 6 8 10 K 0 10 20 30 40 50 60 T o p -1 A cc . (i n % ) AWA w2v glo hie 1 2 4 6 8 10 K 0 10 20 30 40 50 60 T o p -1 A cc . (i n % ) Dogs w2v glo hie Figure 3.2: Effect of latent variable K on CUB, AWA and Dogs datasets. We measure Top-1 Accuracy (in %) with the increasing number of latent models, i.e. K, learned with unsupervised class embeddings, i.e. w2v, glo, hie. consistent improvement on all the class embeddings on all datasets over the state-of- the-art. On the other hand, these results helped us notice one crucial difference of doing zero-shot learning on fine-grained and on coarse-grained datasets. The results reported on the original split of AWA (Lampert et al., 2013) that is being widely used in the literature has been constructed in a way that seen and unseen class splits have visually similar classes e.g. while gorilla is in the seen classes, chimpanzee is in the unseen classes. This insures that by using gorilla images, the methods will generalize to images of the the visually similar chimpanzee class whose images were not seen on training. When we build another split that places both gorilla and chimpanzee classes in the unseen/test set, there is no means of distinguishing these objects, as there is no visually closely similar class left in the seen/train set. We observe a significant drop in accuracy for the weaker unsupervised class embeddings on AWA when we randomly select the class splits, as given in Table 3.6, in addition to the original split (Lampert et al., 2013). However, this drop effects our LatEm as well as the state-of-the-art SJE method. Our conclusion from this observation is that the zero-shot learning setting may be better suited for fine-grained classification task. We also evaluate the accuracy of LatEm when the number of matrices in the model is obtained with pruning versus when it is obtained with cross-validation. Table 3.7 presents the performance of LatEm when the model selection is done by pruning (PR) or by cross-validation (CR) on the three datasets. In terms of performance, both methods are equally competitive. Pruning outperforms cross validation on five cases and is outperformed on the remaining six cases. The performance gaps are usually within 1-2% absolute, with the exception of AWA dataset with att and w2v with 72.5% vs. 70.7% and 52.3% vs. 49.3%, for CV and for PR respectively. Hence, neither of the methods has a clear advantage in terms of performance, however cross validation in general performs slightly better and is faster. Effect of K. In this section, we investigate the experiments performed using five- folds on the CUB, AWA and Dogs datasets and provide further analysis for a varying number of K. For completeness of the analysis, we also evaluate the single latent embedding case, namely K ∈{1, 2, 4, 6, 8, 10} using unsupervised embeddings, i.e. w2v, glo and hie for consistency. 3.4 experiments 43 small bird with mostly yellow plumage sea bird with red eyes blue plumage with brown wings Glove Word2vec long and pointy beak brown head, light breast, small bird completely black plumage Hierarchy small bird with yellow belly pointy beak, spotted and climbing tree trunks sea bird with curved beak Attribute black wings pattern on head-eye region black region on the head Figure 3.3: Top images ranked by the matrices using word2vec, glove, hierarchy and attribute class embeddings on CUB dataset, each row corresponds to different matrix in the model. Qualitative examples support our intuition – each latent variable captures certain visual aspects of the bird. Note that, while the images may not belong to the same fine-grained class, they share common visual properties. In Figure 3.2 we present the performance of the model with a different number of matrices on CUB, AWA and Dogs datasets. For CUB, we observe that the performance generally increases with increasing K, initially, and then the patterns differ with different embeddings. With w2v the performance keeps increasing until K = 6 and then starts decreasing, probably due to model overfitting. With glo the performance increases until K = 10 where the final accuracy is ≈ 5% higher than with K = 1. With the hie embedding the standard errors do not increase significantly in any of the cases, are similar for all values of K and there is no clear trend in the performance. For AWA, although glo results decrease with the increasing number of K, for w2v and hie the results do not vary significantly but they pick the values 10 and 4 respectively. For Dogs, this time w2v results decrease slightly with the increasing number of K for K > 2. In this dataset, K = 2, 8, 10 seems to be the best options for w2v, hie and glo respectively. Interpretability of latent embeddings. As we demonstrated previously, our novel LatEm model improves the state-of-the-art SJE model for zero-shot classification on two fine-grained datasets, i.e. CUB and Dogs, and one coarse-grained dataset, i.e. AWA. In this section, we take a closer look at the results on the challenging CUB dataset and investigate if individual Wi’s learn visually consistent and interpretable latent relationships between images and classes. Figure 7.7 shows the top scoring 44 chapter 3. latent embedding for zero-shot image classification CUB AWA Dogs T1 T5 T10 T1 T5 T10 T1 T5 T10 att 12.4 46.8 67.4 4.8 65.6 90.6 n/a w2v 0.7 29.2 46.3 0.0 31.2 63.5 0.0 6.6 20.5 glo 0.5 26.0 40.5 0.0 36.1 66.2 0.0 6.1 21.3 hie 0.0 19.7 36.3 0.0 40.0 62.1 1.3 15.0 31.8 Table 3.8: Average per-class top-1, 5 and 10 accuracy, i.e. T1, T5 and T10 respectively, in generalized zero-shot learning setting when we have no samples from Yts during training, however the search space during testing includes all the available labels, i.e. namely Y = Ytr ∪Yv ∪Yts. images retrieved by three different Wi for w2v, glo, hie and att. For w2v, the images in the first row are of birds which have long and pointy beaks. Note that they belong to different classes; having a long and pointy beak is one of the shared aspect of those different bird species. Similarly, for the second row images are of small birds with brown head and light-colored breast and the last row contains large birds with completely black plumage. These results are interesting because they show that, our LatEm is able to (i) infer hidden common properties of classes and (ii) support them with visual evidence, leading to a clustering which is optimized for classification, and also performs well in retrieval. For glo, similar to the results with w2v, the top-scoring images of the same Wi consistently show distinguishing visual properties of classes. The first row shows that blue birds from different species are clustered together which indicates that this matrix captures the “blue”ness of the birds. The second row has exclusively aquatic birds, i.e. surrounded by water. Finally, the third row shows yellow birds only. Similar to w2v, for glo our LatEm is able to bring out the latent information that reflect object attributes and support this with its visual counterpart. For completion, we also include qualitative results with hie and att class embed- dings. The first row with hie shows small yellow birds with yellow belly, the second row shows different species of birds with a pointy beak climbing on tree trunks and the third row shows sea birds with curved beaks. Similarly, the first row with att shows different birds with a common property of having “black wings”, the second row shows a distinctive pattern on the head region and the third row shows birds with different amount of blackness on their heads. These results clearly demonstrate that our model factorizes the space with visually interpretable relations between classes, also with hie and att. 3.4.2 Generalized Zero-shot Learning Setting Most existing works on zero-shot learning assume that all the images are from unseen classes during the test phase, which simplifies the problem as the classifiers 3.4 experiments 45 ch im pa nz ee gi an t+ pa nd a le op ar d pe rs ia n+ ca t pi g hi pp op ot am us hu m pb ac k+ wh al e ra cc oo n ra t se al an te lo pe gr izz ly+ be ar kil le r+ wh al e be av er da lm at ia n ho rs e ge rm an +s he ph er d bl ue +w ha le sia m es e+ ca t sk un k m ol e tig er m oo se sp id er +m on ke y el ep ha nt go ril la ox fo x sh ee p ha m st er sq ui rre l rh in oc er os ra bb it ba t gi ra ffe wo lf ch ih ua hu a we as el ot te r bu ffa lo ze br a de er bo bc at lio n m ou se po la r+ be ar co llie wa lru s co w do lp hi n Prediction chimpanzee giant+panda leopard persian+cat pig hippopotamus humpback+whale raccoon rat seal antelope grizzly+bear killer+whale beaver dalmatian horse german+shepherd blue+whale siamese+cat skunk mole tiger moose spider+monkey elephant gorilla ox fox sheep hamster squirrel rhinoceros rabbit bat giraffe wolf chihuahua weasel otter buffalo zebra deer bobcat lion mouse polar+bear collie walrus cow dolphin G ro u n d tr u th 0 50 100 150 200 250 300 350 400 450 500 chimpanzee giant+panda persian+cat humpback+whale rat seal blue+whale siamese+cat gorilla mouse polar+bear dolphin Figure 3.4: Left: Confusion matrix of all the classes on AWA dataset based on the latent factors learned using LatEm in the general setting (we use glo as class embedding). 10 unseen classes are shown at the top of the confusion matrix. Right: t-SNE visualization of the confusion matrix with seen and unseen classes marked with blue and red respectively. Visually similar classes such as chimpanzee and gorilla are embedded close to each other, hence being confused by the classifier. only need to distinguish between unseen classes. In this section, we evaluate our LatEm in a more challenging yet realistic setting, here the prediction function is: f (x) = argmax y∈{Yu∪Ys} F(x, y). (3.9) As shown in Equation 3.9, in the generalized zero-shot learning setting (e.g. ?Socher et al., 2013) the search space includes all the class embeddings both at training time and at test time. Similar to the zero-shot learning setting, the extreme case of generalized zero-shot learning setting assumes no availability of visual samples from test classes during training. As we do not have access to any images of Yts during training, class embeddings of Yts do not get coupled with any visual information, hence act only as distractors. In the following sections, we first evaluate the extreme case of generalized zero-shot learning setting, i.e. when we have no visual samples from test classes during training, and then we gradually increase the number of images from Yts during training. No samples from Yts during training. In this setting, during training although we do not have access to any visual samples from test classes, our scoring function takes a max over all the available class embeddings. As the class embeddings of test classes never get any supervision signal, they act as distractors. We present results obtained in this setting on CUB, AWA and Dogs using all four class embeddings on Table 3.8. Our observation from Table 3.8 is that with Top-1 accuracy LatEm gives poor results even with expert annotated attributes. Note that, a similar behavior was observed in (Rohrbach et al., 2011, 2013; Socher et al., 2013). These results show that 46 chapter 3. latent embedding for zero-shot image classification 0 2 5 10 15 20 25 # training samples per class 0 20 40 60 80 100 T o p -1 A cc . (i n % ) CUB att w2v glo hie 0 2 5 10 25 50 100 # training samples per class 0 20 40 60 80 100 T o p -1 A cc . (i n % ) AWA att w2v glo hie 0 2 5 10 15 25 50 # training samples per class 0 20 40 60 80 100 T o p -1 A cc . (i n % ) Dogs w2v glo hie 0 2 5 10 15 20 25 # training samples per class 0 20 40 60 80 100 T o p -5 A cc . (i n % ) CUB att w2v glo hie 0 2 5 10 25 50 100 # training samples per class 0 20 40 60 80 100 T o p -5 A cc . (i n % ) AWA att w2v glo hie 0 2 5 10 15 25 50 # training samples per class 0 20 40 60 80 100 T o p -5 A cc . (i n % ) Dogs w2v glo hie 0 2 5 10 15 20 25 # training samples per class 0 20 40 60 80 100 T o p -1 0 A cc . (i n % ) CUB att w2v glo hie 0 2 5 10 25 50 100 # training samples per class 0 20 40 60 80 100 T o p -1 0 A cc . (i n % ) AWA att w2v glo hie 0 2 5 10 15 25 50 # training samples per class 0 20 40 60 80 100 T o p -1 0 A cc . (i n % ) Dogs w2v glo hie Figure 3.5: Generalized zero- and few-shots learning settings evaluated on all for CUB, AWA and Dogs using att (where available), w2v, glo and hie embeddings. We show the Top-1, Top-5 and top-10 Accuracy (in%) with the increasing number of images per unseen class used during training. evaluating the model on both seen and unseen classes is a harder problem and it requires more attention. Although solving this problem is out of the scope of this chapter, we provide further analysis on understanding the problem itself. Our hypothesis is that the classes that are similar in context, i.e. chimpanzee and gorilla, are separated into different sets in terms of seen and unseen classes. To evaluate this hypothesis, after learning the LatEm model on AWA using glo embedding, we build a confusion matrix of the test images that belong to both seen and unseen classes. Figure 3.4 plots the confusion matrix and t-SNE (van der Maaten and Hinton, 2008) visualization of the confusion matrix. We observe that the classifier is indeed able to embed images of chimpanzees close to the chimpanzee and gorilla. However, without having seen sufficient examples of the unseen class chimpanzee, it is not able to distinguish between a chimpanzee and a gorilla. Same phenomenon is observed for other visually similar class pairs, e.g. blue-whale and humpback-whale, polar-bear and giant-panda, mouse and rat, which are visually similar 3.4 experiments 47 animals belonging to seen and unseen classes respectively. Following this analysis, we argue that in the presence of seen and unseen classes for testing, evaluating Top-5 or Top-10 accuracy may be a more suitable way to measure performance. Indeed, Top-5 accuracy has been the evaluation criteria of image classification challenge (Berg et al.) of ImageNet (Deng et al., 2009). We present results with Top-5 accuracy on Table 3.8. Our immediate observation is that for all datasets the results improve by 6 to 40% compared to the results with Top-1 accuracy. This shows that 6 − 40% of the time, the images of unseen classes are incorrectly assigned to the second closest class among the seen classes, e.g. chimpanzee versus gorilla, or vice versa. This outcome follows our intuition that LatEm confuses two similar classes especially when they belong to disjoint sets of seen and unseen classes. Finally, our results with Top-10 accuracy shows a similar tendency to the difference between Top-5 and Top-1 accuracy. We observe another accuracy increase of 15 to 30% compared to the Top-5 accuracy depending on the dataset and class embedding. Moreover, as expected the Top-10 accuracy results are higher than Top-1 and Top-5 accuracy while the relative difference between different class embeddings remain similar in all cases. We also observe from these results that supervised attributes remain important with the lack of training data in the extreme case. In CUB and AWA, top-10 accuracy obtained with unsupervised class embeddings extracted from wikipedia, i.e. w2v and glo perform similarly to the top-5 accuracy obtained with attribute class embeddings. On the other hand, the human supervision signal that comes from attributes leads to an accuracy boost of almost 30% when we measure top-5 or top-10 accuracy. Finally in Dogs, hie class embeddings perform higher than w2v and glo that are extracted from wikipedia. It is interesting to note that this observation is unique to this dataset and it is in line with our observations in the classic zero-shot learning setting. This shows that finding the most suitable class embedding is an important aspect of tackling the zero-shot learning task. Generalized zero-shot to generalized few-shots setting. As shown in the previous section, the presence of all class embeddings, i.e. generalized zero-shot setting, in its extreme case, i.e. no visual samples from test classes during training, result in a significant loss in accuracy compared to the classic zero-shot learning setting. This is expected since during training the test class embedding act as distractors since they are not coupled with any visual examples. In this section, we investigate the generalized zero-shot and generalized few-shot learning settings, namely the settings with the presence of either no or a few examples from test classes for training, respectively. We present the stability of our LatEm in this setting by running it on five dataset folds with the error bars in Figure 3.5. We report per-class averaged Top-1, Top5 and Top-10 accuracy results with all four class embeddings, i.e. att (on CUB and AWA), w2v, glo and hie. We show the importance of visual data by increasing the number of images from 0 to 25, 100 and 50 on CUB, AWA and Dogs respectively. On CUB, although att class embedding obtains the highest top-1, top-5 and top-10 accuracy on both the generalized zero-shot and generalized 2 − 5-shots settings, it 48 chapter 3. latent embedding for zero-shot image classification is interesting to observe that glo embedding reaches the same accuracy after the presence of 10 samples on Top-5 and Top-10 accuracies and obtains the highest accuracy in all cases, i.e. Top-1,5 and 10, when all 25 images are used for training. Another observation from CUB results is that the results are stable in all five folds of the data. On AWA, a striking observation is how well glo class embedding performs for Top-1, Top-5 and Top-10 accuracy on generalized few-shots learning setting. With the presence of 100 images per class, Top-1 accuracy between att and glo embeddings is 20% for both Top-1 and Top-5 accuracy. Also, on AWA, the accuracy difference between different class embeddings is quite high. This may be because AWA is a coarse-grained dataset as the similar observation does not hold for CUB and Dogs. On Dogs, unlike the classical and generalized zero-shot learning results, hie embedding is not the best performing class embedding in generalized few-shot learning setting. In this dataset, w2v is the best performing embedding on all evaluation metrics. i.e. Top-1, Top-5 and Top-10 accuracy. Another observation from Dogs results is that with the presence of 50 images per-class during training, all class embeddings converge to the same value, i.e. class embeddings lose their importance. As a conclusion, with the increasing the number of additional training samples from unseen classes the results improve significantly in all cases until the accuracy improvements flatten out gradually. These results show that with the availability of a large number of images from both seen and unseen classes, the importance of the contribution of class embeddings has been reduced. (Akata et al., 2015a) has shown that using hand-crafted image features, the one-vs-rest SVM strategy becomes more favourable compared to embedding-based methods only with the availability of a large number of annotated images. Here, we show that leveraging deep image features with even a few additional samples, i.e. 2, 5, 10, we improve over human annotated attributes and increase zero-shot accuracy by approximately 20%, demonstrated by the results obtained with AWA. 3.5 conclusions We presented a novel latent variable model, Latent Embeddings (LatEm), for learn- ing a nonlinear (piecewise linear) compatibility function for the task of zero-shot classification. LatEm is a multi-modal method, it uses images and class-level side- information either obtained through human annotation or in an unsupervised way from a large text corpus. LatEm incorporates multiple linear compatibility units and allows each image to choose one of them – such choices being the latent variables. We proposed a ranking based objective to learn the model using an efficient and scalable SGD based solver. We empirically validated our model on three challenging benchmark datasets for zero-shot classification of Birds, Dogs and Animals. We improved the state- of-the-art for zero-shot learning using unsupervised class embeddings on AWA up to 71.1% (vs. 60.1%) and on two fine-grained datasets, achieving 33.2% (vs. 3.5 conclusions 49 29.9%) on CUB as well as achieving 33.8% (vs. 26.7%) on Dogs. On AWA, we also improve the accuracy obtained with supervised class embeddings, obtaining 75.7% (vs. 73.9%). This demonstrates quantitatively that our method learns a latent structure in the embedding space through multiple compatibility units. We also presented a qualitative analysis of our results and showed that the latent embeddings learned with our method leads to visual consistencies. Our stability analysis on five dataset folds for all three benchmark datasets showed that our method can generalize well and does not overfit to the current dataset splits. We proposed a new method for selecting the number of latent variables automatically from the data by pruning. Such pruning based method speeds up the training and leads to models with competitive space-time complexities compared to the cross-validation based method. We further extended our application domain to generalized zero-shot and gener- alized few-shot learning setting where at training time we assume the availability of either no or a few labeled samples from unseen classes. On the other hand, both at training and test time the search space includes all the class embeddings from seen and unseen classes. As expected, our evaluation on generalized zero-shot learning setting showed a significant loss of accuracy compared to the standard zero-shot learning setting which we analyzed through visualizations and quantitative results. Through these experiments we raised awareness that even state-of-the-art methods confuse two visually similar classes if one of them is an unseen class, i.e. the method has seen no samples from that class. Our evaluation on generalized few-shots setting showed that with as few as two to ten samples from unseen classes, unsupervised class embeddings can outperform the supervised attributes. Therefore, with increas- ing number of additional training samples, the difference between different class embeddings are reduced. As a future work, we plan to investigate the challenging however realistic generalized zero-shot and generalized few-shots settings further. 4 Z E R O - S H O T L E A R N I N G : T H E G O O D , T H E B A D A N D T H E U G LY Contents 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.3 Evaluated Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.3.1 Learning Linear Compatibility . . . . . . . . . . . . . . . . . 55 4.3.2 Learning Nonlinear Compatibility . . . . . . . . . . . . . . . 57 4.3.3 Learning Intermediate Attribute Classifiers . . . . . . . . . . 57 4.3.4 Hybrid Models . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.3.5 Transductive Zero-Shot Learning Setting . . . . . . . . . . . 59 4.4 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.4.1 Attribute Datasets . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.4.2 Large-Scale ImageNet . . . . . . . . . . . . . . . . . . . . . . 62 4.5 Evaluation Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.5.1 Image and Class Embedding . . . . . . . . . . . . . . . . . . 63 4.5.2 Dataset Splits . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.5.3 Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . . . . 65 4.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.6.1 Zero-Shot Learning Experiments . . . . . . . . . . . . . . . . 66 4.6.2 Generalized Zero-Shot Learning Results . . . . . . . . . . . . 74 4.6.3 Transductive (Generalized) Zero-Shot Learning . . . . . . . 76 4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 I n the previous chapter, we propose a non-linear embedding function for better zero-shot learning performance. However, we realize that evaluation settings of previous works are inconsistent, leading to incomparable results. Therefore, in this chapter, we introduce a better zero-shot image classification benchmark and evaluate SOTA approaches under the same evaluation protocols. Our new evaluation protocol includes the convention zero-shot learning that predicts only novel classes and the realistic generalized zero-shot learning where both base and novel classes should be evaluated. We also propose correct class splits where novel classes are not present in the pretraining dataset e.g. ImageNet. In Chapter 5, we adopt the evaluation setting introduced in this chapter and propose an efficient feature generation approach for the challenging generalized zero-shot learning task. In Chapter 6, we follow the same evaluation protocol, introduce a stronger feature generator by combining VAE and GANs, and show unlabeled data significantly improves quality of generated features. Chapter 7 and 51 52 chapter 4. zero-shot learning: the good, the bad and the ugly Chapter 8 demonstrate our efforts in advancing zero-shot and few-shot learning for the semantic segmentation and video classification tasks. 4.1 introduction Zero-shot learning aims to recognize objects whose instances may not have been seen during training (e.g. Lampert et al., 2013; Larochelle et al., 2008; Rohrbach et al., 2011; Yu and Aloimonos, 2010; Xu et al., 2017; Ding et al., 2017). The number of new zero-shot learning methods proposed every year has been increasing rapidly, i.e. the good aspects as our title suggests. Although each new method has been shown to make progress over the previous one, it is difficult to quantify this progress without an established evaluation protocol, i.e. the bad aspects. In fact, the quest for improving numbers has lead to even flawed evaluation protocols, i.e. the ugly aspects. Therefore, in this work, we propose to extensively evaluate a significant number of recent zero-shot learning methods in depth on several small to large-scale datasets using the same evaluation protocol both in zero-shot, i.e. training and test classes are disjoint, and the more realistic generalized zero-shot learning settings, i.e. training classes are present at test time. Figure 8.1 presents an illustration of zero-shot and generalized zero-shot learning tasks. We benchmark and systematically evaluate zero-shot learning w.r.t. three aspects, i.e. methods, datasets and evaluation protocol. The crux of the matter for all zero- shot learning methods is to associate observed and non observed classes through some form of auxiliary information which encodes visually distinguishing properties of objects. Different flavors of zero-shot learning methods that we evaluate in this work are linear (e.g. Frome et al., 2013; Akata et al., 2013, 2015c; Romera-Paredes et al., 2015) and nonlinear (e.g. Xian et al., 2016; Socher et al., 2013) compatibility learning frameworks which have dominated the zero-shot learning literature in the past few years whereas an orthogonal direction is learning independent attribute (Lampert et al., 2013) classifiers and finally others (e.g. Zhang and Saligrama, 2015; Changpinyo et al., 2016; Norouzi et al., 2014) propose a hybrid model between independent classifier learning and compatibility learning frameworks which have demonstrated improved results over the compatibility learning frameworks both for zero-shot and generalized zero-shot learning settings. We thoroughly evaluate the second aspect of zero-shot learning, by using multiple splits of several small, medium and large-scale datasets (e.g. Patterson and Hays, 2012; Welinder et al., 2010; Lampert et al., 2013; Farhadi et al., 2009; Deng et al., 2009). Among these, the Animals with Attributes (AWA1) dataset (Lampert et al., 2013) introduced as a zero-shot learning dataset with per-class attribute annotations, has been one of the most widely used datasets for zero-shot learning. However, as AWA1 images does not have the public copyright license, only some image features, i.e. SIFT (Lowe, 2004), DECAF (Donahue et al., 2014), VGG19 (Simonyan and Zisserman, 2014b) of AWA1 dataset is publicly available, rather than the raw images. On the other hand, improving image features is a significant part of the progress both 4.1 introduction 53 Ytr Training time Yts Yts ∪ Ytr Zero-shot Learning Generalized Zero-Shot Learning Test time polar bear black: no white : yes brown: yes stripes: no water: yes eats fish:yes zebra black: yes white : yes brown: no stripes: yes water: no eats fish: no otter black: yes white : no brown: yes stripes: no water: yes eats fish: yes tiger black: yes white : yes brown: no stripes: yes water: no eats fish: no otter black: yes white : no brown: yes stripes: no water: yes eats fish: yes tiger black: yes white : yes brown: no stripes: yes water: no eats fish: no polar bear black: no white : yes brown: yes stripes: no water: yes eats fish: yes zebra black: yes white : yes brown: no stripes: yes water: no eats fish: no Figure 4.1: Zero-shot learning (ZSL) vs generalized zero-shot learning (GZSL): At training time, for both cases the images and attributes of the seen classes (Ytr) are available. At test time, in the ZSL setting, the learned model is evaluated only on unseen classes (Yts) whereas in GZSL setting, the search space contains both training and test classes (Ytr ∪Yts). To facilitate classification without labels, both tasks use some form of side information, e.g. attributes. The attributes are annotated per class, therefore the labeling cost is significantly reduced. for supervised learning and for zero-shot learning. In fact, with the fast pace of deep learning, everyday new deep neural network models improve the ImageNet classification performance are being proposed. Without access to images, those new DNN models can not be evaluated on AWA1 dataset. Therefore, with this work, we introduce the Animals with Attributes 2 (AWA2) dataset that has roughly the same number of images all with public licenses, exactly the same number of classes and attributes as the AWA1 dataset. We will make both ResNet (He et al., 2016) features of AWA2 images and the images themselves publicly available. We propose a unified evaluation protocol to address the third aspect of zero-shot learning which is one of the most important ones. We emphasize the necessity of tuning hyperparameters of the methods on a validation class split that is disjoint from training classes as improving zero-shot learning performance via tuning parameters on test classes violates the zero-shot assumption. We argue that per-class averaged top-1 accuracy is an important evaluation metric when the dataset is not well balanced with respect to the number of images per class. We point out that extracting image features via a pre-trained deep neural network (DNN) on a large dataset that contains zero-shot test classes also violates the zero-shot learning idea as image feature extraction is a part of the training procedure. Moreover, we argue that demonstrating zero-shot performance on small-scale and coarse grained datasets, i.e. aPY (Farhadi et al., 2009) is not conclusive. On the other hand, with this work 54 chapter 4. zero-shot learning: the good, the bad and the ugly we emphasize that it is hard to obtain labeled training data for fine-grained classes of rare objects recognizing which requires expert opinion. Therefore, we argue that zero-shot learning methods should be also evaluated on least populated or rare classes. We recommend to abstract away from the restricted nature of zero-shot evaluation and make the task more practical by including training classes in the search space, i.e. generalized zero-shot learning setting. Therefore, we argue that our work plays an important role in advancing the zero-shot learning field by analyzing the good and bad aspects of the zero-shot learning task as well as proposing ways to eliminate the ugly ones. 4.2 related work A more comprehensive literature review can be found in Chapter 2. Here we only discuss the relation of our benchmark to existing zero-shot learning evaluation protocols. Zero-shot learning has been criticized for being a restrictive set up as it comes with a strong assumption of the image used at prediction time can only come from unseen classes. Therefore, generalized zero-shot learning setting (Scheirer et al., 2013) has been proposed to generalize the zero-shot learning task to the case where both seen and unseen classes are used at test time. (Jain et al., 2014) argues that although ImageNet classification challenge performance has reached beyond human performance, we do not observe similar behavior of the methods that compete at the detection challenge which involves rejecting unknown objects while detecting the position and label of a known object. (Frome et al., 2013) uses label embeddings to operate on the generalized zero-shot learning setting whereas (Zhang et al., 2016a) proposes to learn latent representations for images and classes through coupled linear regression of factorized joint embeddings. On the other hand, (Bendale and Boult, 2016) introduces a new model layer to the deep net which estimates the probability of an input being from an unknown class and (Socher et al., 2013) proposes a novelty detection mechanism. Although zero-shot vs generalized zero-shot learning evaluation works ex- ist (Rohrbach et al., 2011; Chao et al., 2016) in the literature, our work stands out in multiple aspects. For instance, (Rohrbach et al., 2011) operates on the ImageNet 1K by using 800 classes for training and 200 for test. One of the most comprehensive works, (Chao et al., 2016) provides a comparison between five methods evaluated on three datasets including ImageNet with three standard splits and proposes a metric to evaluate generalized zero-shot learning performance. On the other hand, we evaluate ten zero-shot learning methods on five datasets with several splits both for zero-shot and generalized zero-shot learning settings, provide statistical significance and robustness tests, and present other valuable insights that emerge from our benchmark. In this sense, ours is the most extensive evaluation of zero-shot and generalized zero-shot learning tasks in the literature. 4.3 evaluated methods 55 4.3 evaluated methods We start by formalizing the zero-shot learning task and then we describe the zero- shot learning methods that we evaluate in this work. Given a training set S = {(xn, yn), n = 1...N}, with yn ∈Ytr belonging to training classes, the task is to learn f : X →Y by minimizing the regularized empirical risk: 1 N N ∑ n=1 L(yn, f (xn; W)) + Ω(W) (4.1) where L(.) is the loss function and Ω(.) is the regularization term. Here, the mapping f : X →Y from input to output embeddings is defined as: f (x; W) = argmax y∈Y F(x, y; W) (4.2) At test time, in zero-shot learning setting, the aim is to assign a test image to an unseen class label, i.e. Yts ⊂ Y and in generalized zero-shot learning setting, the test image can be assigned either to seen or unseen classes, i.e. Ytr+ts ⊂Y with the highest compatibility score. 4.3.1 Learning Linear Compatibility Attribute Label Embedding (ALE) (Akata et al., 2015a), Deep Visual Semantic Em- bedding (DEVISE) (Frome et al., 2013) and Structured Joint Embedding (SJE) (Akata et al., 2015c) use bi-linear compatibility function to associate visual and auxiliary information: F(x, y; W) = θ(x)TWφ(y) (4.3) where θ(x) and φ(y), i.e. image and class embeddings, both of which are given. F(.) is parameterized by the mapping W, that is to be learned. Given an image, compatibility learning frameworks predict the class which attains the maximum compatibility score with the image. Among the methods that are detailed below, ALE (Akata et al., 2015a), DE- VISE (Frome et al., 2013) and SJE (Akata et al., 2015c) do early stopping to implicitly regularize Stochastic Gradient Descent (SGD) while ESZSL (Romera-Paredes et al., 2015) and SAE (Kodirov et al., 2017) explicitly regularize the embedding model as detailed below. In the following, we provide a unified formulation of these five zero-shot learning methods. DEVISE (Frome et al., 2013) uses pairwise ranking objective that is inspired from unregularized ranking SVM (Joachims, 2002): ∑ y∈Ytr [∆(yn, y) + F(xn, y; W)− F(xn, yn; W)]+ (4.4) 56 chapter 4. zero-shot learning: the good, the bad and the ugly where ∆(yn, y) is equal to 1 if yn = y, otherwise 0. The objective function is convex and is optimized by Stochastic Gradient Descent. ALE (Akata et al., 2015a) uses the weighted approximate ranking objective (Usunier et al., 2009) for zero-shot learning in the following way: ∑ y∈Ytr lr∆(xn ,yn) r∆(xn,yn) [∆(yn, y) + F(xn, y; W)− F(xn, yn; W)]+ (4.5) where lk = ∑ k i=1 αi and r∆(xn,yn) is defined as: ∑ y∈Ytr 1(F(xn, y; W) + ∆(yn, y) ≥ F(xn, yn; W)) (4.6) Following the heuristic in (Weston et al., 2011), (Akata et al., 2015a) selects αi = 1/i which puts a high emphasis on the top of the rank list. SJE (Akata et al., 2015c) gives the full weight to the top of the ranked list and is inspired from the structured SVM (Tsochantaridis et al., 2005): [max y∈Ytr (∆(yn, y) + F(xn, y; W))− F(xn, yn; W)]+ (4.7) The prediction can only be made after computing the score against all the classifiers, i.e. so as to find the maximum violating class, which makes SJE less efficient than DEVISE and ALE. ESZSL (Romera-Paredes et al., 2015) applies a square loss to the ranking formula- tion and adds the following implicit regularization term to the unregularized risk minimization formulation: γ‖Wφ(y)‖2 + λ‖θ(x)TW‖2 + β‖W‖2 (4.8) where γ, λ, β are regularization parameters. The first two terms bound the Euclidean norm of projected attributes in the feature space and projected image feature in the attribute space respectively. The advantage of this approach is that the objective function is convex and has a closed form solution. SAE (Kodirov et al., 2017) also learns the linear projection from image embedding space to class embedding space, but it further constrains that the projection must be able to reconstruct the original image embedding. Similar to the linear auto-encoder, SAE optimizes the following objective: min W ||θ(x)− W T φ(y)||2 + λ||Wθ(x)− φ(y)||2, (4.9) where λ is a hyperparameter to be tuned. The optimization problem can be trans- formed such that Bartels-Stewart algorithm (Bartels and Stewart, 1972) is able to solve it efficiently. 4.3 evaluated methods 57 4.3.2 Learning Nonlinear Compatibility Latent Embeddings (LATEM) (Xian et al., 2016) and Cross Modal Transfer (CMT) (Socher et al., 2013) encode an additional non-linearity component to linear compatibility learning framework. LATEM (Xian et al., 2016) constructs a piece-wise linear compatibility: F(x, y; Wi) = max 1≤i≤K θ(x)TWi φ(y) (4.10) where every Wi models a different visual characteristic of the data and the selection of which matrix to use to do the mapping is a latent variable and K is a hyperparameter to be tuned. LATEM uses the ranking loss formulated in Equation 4.4 and Stochastic Gradient Descent as the optimizer. CMT (Socher et al., 2013) first maps images into a semantic space of words, i.e. class names, where a neural network with tanh nonlinearity learns the mapping: ∑ y∈Ytr ∑ x∈Xy ‖φ(y)− W1 tanh(W2.θ(x)‖2 (4.11) where (W1, W2) are weights of the two layer neural network. This is followed by a novelty detection mechanism that assigns images to unseen or seen classes. The novelty is detected either via thresholds learned using the embedded images of the seen classes or the outlier probabilities are obtained in an unsupervised way. As zero-shot learning assumes that test images are only from unseen classes, in our experiments when we refer to CMT, that means we do not use the novelty detection component. On the other hand, we name the CMT with novelty detection as CMT* when we apply it to the generalized zero-shot learning setting. 4.3.3 Learning Intermediate Attribute Classifiers Although Direct Attribute Prediction (DAP) (Lampert et al., 2013) and Indirect Attribute Prediction (IAP) (Lampert et al., 2013) have been shown to perform poorly compared to compatibility learning frameworks (Akata et al., 2015a), we include them to our evaluation for being historically the most widely used methods in the literature. DAP (Lampert et al., 2013) learns probabilistic attribute classifiers and makes a class prediction by combining scores of the learned attribute classifiers. A novel image is assigned to one of the unknown classes using: f (x) = argmax c M ∏ m=1 p(acm|x) p(acm) . (4.12) with M being the total number of attributes, acm is the m-th attribute of class c, p(acm|x) is the attribute probability given image x which is obtained from the attribute 58 chapter 4. zero-shot learning: the good, the bad and the ugly classifiers whereas p(acm) is the attribute prior estimated by the empirical mean of attributes over training classes. We train binary classifiers with logistic regression that gives probability scores of attributes with respect to training classes. IAP (Lampert et al., 2013) indirectly estimates attributes probabilities of an image by first predicting the probabilities of each training class, then multiplying the class attribute matrix. Once the attributes probabilities are obtained by the following equation: p(am|x) = K ∑ k=1 p(am|yk)p(yk|x), (4.13) where K is the number of training classes, p(am|yk) is the predefined class attribute and p(yk|x) is training class posterior from multi-class classifier, the Equation 4.12 is used to predict the class label for which we train a multi-class classifier on training classes with logistic regression. 4.3.4 Hybrid Models Semantic Similarity Embedding (SSE) (Zhang and Saligrama, 2015), Convex Com- bination of Semantic Embeddings (CONSE) (Norouzi et al., 2014) and Synthesized Classifiers (SYNC) (Changpinyo et al., 2016) express images and semantic class embeddings as a mixture of seen class proportions, hence we group them as hybrid models. SSE (Zhang and Saligrama, 2015) leverages similar class relationships both in image and semantic embedding space. An image is labeled with: argmax u∈U π(θ(x))T ψ(φ(yu)) (4.14) where π, ψ are mappings of class and image embeddings into a common space defined by the mixture of seen classes proportions. Specifically, ψ is learned by sparse coding and π is by class dependent transformation. CONSE (Norouzi et al., 2014) learns the probability of a training image belonging to a training class: f (x, t) = argmax y∈Ytr ptr(y|x) (4.15) where y denotes the most likely training label (t=1) for image x. Combination of semantic embeddings (s) is used to assign an unknown image to an unseen class: 1 Z T ∑ i=1 ptr( f (x, t)|x).s( f (x, t)) (4.16) where Z = ∑Ti=1 ptr( f (x, t)|x), f (x, t) denotes the t th most likely label for image x and T controls the maximum number of semantic embedding vectors. 4.3 evaluated methods 59 SYNC (Changpinyo et al., 2016) learns a mapping between the semantic class embedding space and a model space. In the model space, training classes and a set of phantom classes form a weighted bipartite graph. The objective is to minimize distortion error: min wc ‖wc − R ∑ r=1 scr vr‖22. (4.17) Semantic and model spaces are aligned by embedding classifiers of real classes (wc) and classifiers of phantom classes (vr) in the weighted graph (scr). The classifiers for novel classes are constructed by linearly combining classifiers of phantom classes. GFZSL (Verm and Rai, 2017) proposes a generative framework for zero-shot learning by modeling each class-conditional distribution as a multi-variate Gaussian with mean vector µ and diagonal covariance matrix σ. While the parameters of seen classes can be estimated by MLE, that of unseen classes are computed by learning the following two regression functions: µy = fµ(φ(y)) and σy = fσ(φ(y)) (4.18) with an image x, its class is predicted by searching the class with the maximum probability, i.e. argmaxy p(x|σy, µy). 4.3.5 Transductive Zero-Shot Learning Setting In zero-shot learning, transductive setting (Chapelle et al., 2009; Zhou et al., 2004) implies that unlabeled images from unseen classes are available during training. Using unlabeled images are expected to improve performance as they possibly contain useful latent information of unseen classes. Here, we mainly focus on two state-of-the-art transductive approaches(Verm and Rai, 2017; Ye and Guo, 2017) and show how to extend ALE (Akata et al., 2015a) into the transductive learning setting. GFZSL-tran (Verm and Rai, 2017) uses an Expectation-Maximization (EM) based procedure that alternates between inferring the labels of unlabeled examples of unseen classes and using the inferred labels to update the parameter estimates of unseen class distributions. Since the class-conditional distribution is assumed to be Gaussian, this procedure is equivalent to repeatedly estimating a Gaussian Mixture Model (GMM) with the unlabeled data from unseen classes and use the inferred class labels to re-estimate the GMM. DSRL (Ye and Guo, 2017) proposes to simultaneously learn image features with non-negative matrix factorization and align them with their corresponding class attributes. This step gives us an initial prediction score matrix S0 in which each row is one instance and indicates the prediction scores for all unseen classes. To improve the prediction score matrix by transductive learning, a graph-based label propagation algorithm is applied. Specifically, a KNN graph is constructed with the 60 chapter 4. zero-shot learning: the good, the bad and the ugly projected instances of unseen classes in the class embedding space, Mij = { ex p(−d(xi ,xj) 2σ2 ) if i ∈ KNN(j) or j ∈ KNN(i) 0 otherwise (4.19) where KNN(i) denotes the k-nearest neighbor of i-th instance and d(xi, xj) measures the Euclidean distance between xi and xj. Given the affinity matrix M, a normalized Laplacian matrix L can be computed as L = Q−1/2 MQ−1/2 where Q is a diagonal matrix with Qii = ∑j Mij. Finally, the standard label propagation (?) gives the closed-form solution: S = (I − αL)−1 × S0 (4.20) where α ∈ [0, 1] is a regularization trade-off parameter and S is the score matrix. The class label of an instance is predicted by searching the class with the highest score, i.e. argmaxy Siy. ALE-tran Any compatibility learning method that explicitly learns cross-modal mapping from image feature space to class embedding space can be extended to transductive setting following the label propagation procedure of DSRL (Ye and Guo, 2017). Taking the ALE (Akata et al., 2015a) as an example, after learning the linear mapping W, instances of unseen classes can be projected into the class embedding space and a score matrix S0 can be computed similarly. 4.4 datasets Among the most widely used datasets for zero-shot learning, we select two coarse- grained, one small (aPY (Farhadi et al., 2009)) and one medium-scale (AWA1 (Lampert et al., 2013)), and two fine-grained, both medium-scale, datasets (SUN (Patterson and Hays, 2012), CUB (Welinder et al., 2010)) with attributes and one large-scale dataset (ImageNet (Deng et al., 2009)) without. Here, we consider between 10K and 1M images, and, between 100 and 1K classes as medium-scale. Details of dataset statistics in terms of the number of images, classes, attributes for the attribute datasets are in Table 5.1. Furthermore, we introduce our Animals With Attributes 2 (AWA2) dataset and position it with respect to existing datasets. 4.4.1 Attribute Datasets Attribute Pascal and Yahoo (aPY) (Farhadi et al., 2009) is a small-scale coarse-grained dataset with 64 attributes. Among the total number of 32 classes, 20 Pascal classes are used for training (we randomly select 5 for validation) and 12 Yahoo classes are used for testing. The original Animals with Attributes (AWA1) (Lampert et al., 2013) is a coarse-grained dataset that is medium-scale in terms of the number of images, i.e. 30, 475 and small-scale in terms of number of classes, i.e. 50 classes. (Lampert et al., 2013) introduces a standard zero-shot split with 40 classes for training (we randomly 4.4 datasets 61 Figure 4.2: Comparing AWA1 (Lampert et al., 2013) and our AWA2 in terms of number of images (Left) and t-SNE embedding of the image features (the embedding is learned on AWA1 and AWA2 simultaneously, therefore the figures are comparable). AWA2 follows a similar distribution as AWA1 and it contains more examples. select 13 classes for validation) and 10 classes for testing. AWA1 has 85 attributes. Caltech-UCSD-Birds 200-2011 (CUB) (Welinder et al., 2010) is a fine-grained and medium scale dataset with respect to both number of images and number of classes, i.e. 11, 788 images from 200 different types of birds annotated with 312 attributes. (Akata et al., 2015a) introduces the first zero-shot split of CUB with 150 training (50 validation classes) and 50 test classes. SUN (Patterson and Hays, 2012) is a fine-grained and medium-scale dataset with respect to both number of images and number of classes, i.e. SUN contains 14340 images coming from 717 types of scenes annotated with 102 attributes. Following (Lampert et al., 2013) we use 645 classes of SUN for training (we randomly select 65 classes for validation) and 72 classes for testing. Animals with Attributes2 (AWA2) Dataset. One disadvantage of AWA1 dataset 62 chapter 4. zero-shot learning: the good, the bad and the ugly is that the images are not publicly available. As having highly descriptive image features is an important component for zero-shot learning, in order to enable vision research on the objects of the AWA1 dataset, we introduce the Animals with At- tributes2 (AWA2) dataset. Following (Lampert et al., 2013), we collect 37, 322 images for the 50 classes of AWA1 dataset from public web sources, i.e. Flickr, Wikipedia, etc., making sure that all images of AWA2 have free-use and redistribution licenses and they do not overlap with images of the original Animal with Attributes dataset. The AWA2 dataset uses the same 50 animal classes as AWA1 dataset, similarly the 85 binary and continuous class attributes are common. In total, AWA2 has 37, 322 images compared to 30, 475 images of AWA1. On average, each class includes 746 images where the least populated class, i.e. mole, has 100 and the most populated class, i.e. horse has 1645 examples. Some example images from polar bear, zebra, otter and tiger classes along with sample attributes from our AWA2 dataset are shown in Figure 8.1. In Figure 4.2, we provide some statistics on the AWA2 dataset in comparison with the AWA1 dataset in terms of the number of images and also the distribution of the image features. Compared to AWA1, our proposed AWA2 dataset contains more images, e.g. horse and dolphin among the test classes, antelope and cow among the training classes. Moreover, the t-SNE embedding of these test classes with more training data, e.g. horse, dolphin, seal etc. shows that AWA2 leads to slightly more visible clusters of ResNet features. The images, their labels and ResNet features of our AWA2 are publicly available in http://cvml.ist.ac.at/AwA2. 4.4.2 Large-Scale ImageNet We also evaluate the performance of methods on the large scale ImageNet (Deng et al., 2009) which contains a total of 14 million images from 21K classes, each one labeled with one label, and the classes are hierarchically related as ImageNet follows the WordNet (Miller, 1995). ImageNet is a natural fit for zero-shot and generalized zero-shot learning as there is a large class imbalance problem. Moreover, ImageNet is diverse in terms of granularity, i.e. it contains a collection of fine-grained datasets, e.g. different vehicle types, as well as coarse-grained datasets. The highest populated class contains 3, 047 images whereas there are many classes that contains only a single image. A balanced subset of ImageNet with 1K classes containing about 1000 images each is used to train CNNs. Previous works (Rohrbach et al., 2011) proposed to split the balanced subset of 1K classes into 800 training and 200 test classes. In this work, from the total of 21K classes, we use 1K classes for training (among which we use 200 classes for validation) and the test split is either all the remaining 20K classes or a subset of it, e.g. we determine these subsets based on the hierarchical distance between classes and the population of classes. The details of these splits are provided in the following section. http://cvml.ist.ac.at/AwA2 4.5 evaluation protocol 63 Number of Classes Number of Images At Training Time At Evaluation Time SS PS SS PS Dataset Att Y Ytr Yts Total Ytr Yts Ytr Yts Ytr Yts Ytr Yts SUN 102 717 580 + 65 72 14340 12900 0 10320 0 0 1440 2580 1440 CUB 312 200 100 + 50 50 11788 8855 0 7057 0 0 2933 1764 2967 AWA1 85 50 27 + 13 10 30475 24295 0 19832 0 0 6180 4958 5685 AWA2 85 50 27 + 13 10 37322 30337 0 23527 0 0 6985 5882 7913 aPY 64 32 15 + 5 12 15339 12695 0 5932 0 0 2644 1483 7924 Table 4.1: Statistics for SUN (Patterson and Hays, 2012), CUB (Welinder et al., 2010), AWA1 (Lampert et al., 2013), proposed AWA2, aPY (Farhadi et al., 2009) in terms of size, granularity, number of attributes, number of classes in Ytr and Yts, number of images at training and test time for standard split (SS) and our proposed splits (PS). 4.5 evaluation protocol In this section, we provide several components of previously used and our proposed ZSL and GZSL evaluation protocols, e.g. image and class encodings, dataset splits and the evaluation criteria3. 4.5.1 Image and Class Embedding We extract image features, namely image embeddings, from the entire image for SUN, CUB, AWA1, our AWA2 and ImageNet, with no image pre-processing. For aPY, following the original publication in (Farhadi et al., 2009), we crop the images from bounding boxes. Our image embeddings are 2048-dim top-layer pooling units of the 101-layered ResNet (He et al., 2016) as we found that it performs better than 1, 024-dim top-layer pooling units of GoogleNet (Szegedy et al., 2015). We use the original ResNet-101 that is pre-trained on ImageNet with 1K classes, i.e. the balanced subset, and we do not fine-tune it for any of the mentioned datasets. In addition to the ResNet features, we re-evaluate all methods with their published image features. In zero-shot learning, class embeddings are as important as image features. As class embeddings, for aPY, AWA1, AWA2, CUB and SUN, we use the per-class attributes between values 0 and 1 that are provided with the datasets as binary attributes have been shown (Akata et al., 2015a) to be weaker than continuous attributes. For ImageNet as attributes of 21K classes are not available, we use Word2Vec (Mikolov et al., 2013b) trained on Wikipedia provided by (Changpinyo et al., 2016). Note that an evaluation of class embeddings is out of the scope of this chapter. We refer the reader to (Akata et al., 2015c) for more details on the topic. 3Our benchmark is in: http://www.mpi-inf.mpg.de/zsl-benchmark http://www.mpi-inf.mpg.de/zsl-benchmark 64 chapter 4. zero-shot learning: the good, the bad and the ugly 4.5.2 Dataset Splits Zero-shot learning assumes disjoint training and test classes. Hence, as deep neural network (DNN) training for image feature extraction is actually a part of model training, the dataset used to train DNNs, e.g. ImageNet, should not include any of the test classes. However, we notice from the standard splits (SS) of aPY and AWA1 datasets that 7 aPY test classes out of 12 (monkey, wolf, zebra, mug, building, bag, carriage), 6 AWA1 test classes out of 10 (chimpanzee, giant panda, leopard, persian cat, pig, hippopotamus), are among the 1K classes of ImageNet, i.e. are used to pre-train ResNet. On the other hand, the mostly widely used splits, i.e. we term them as standard splits (SS), for SUN from (Lampert et al., 2013) and CUB from (Akata et al., 2013) shows us that 1 CUB test class out of 50 (Indigo Bunting), and 6 SUN test classes out of 72 (restaurant, supermarket, planetarium, tent, market, bridge), are also among the 1K classes of ImageNet. We noticed that the accuracy for all methods on those overlapping test classes are higher than others. Therefore, we propose new dataset splits, i.e. proposed splits (PS), insuring that none of the test classes appear in ImageNet 1K, i.e. used to train the ResNet model. We present the differences between the standard splits (SS) and the proposed splits (PS) in Table 5.1. While in SS and PS no image from test classes is present at training time, at test time our PS includes images from training classes. We designed the PS this way as evaluating accuracy on both training and test classes is crucial to show the generalization of the methods. For SUN, CUB, AWA1, aPY, and our proposed AWA2 dataset, for measuring the significance of the results, we propose 3 different splits of 580, 100, 27, 15 and 27 training classes respectively while keeping 72, 50, 10, 12 and 10 test classes the same. It is important to perform hyperparameter search on a disjoint set of validation set of 65, 50, 13, 5 and 13 classes respectively. We keep the number of classes the same for SS and PS, however we choose different classes while making sure that the test classes do not overlap with the 1K training classes of ImageNet. ImageNet provides possibilities of constructing several zero-shot evaluation splits. Following (Changpinyo et al., 2016), our first two standard splits consider all the classes that are 2-hops and 3-hops away from the original 1K classes according to the ImageNet label hierarchy, corresponding to 1509 and 7678 classes. This split measures the generalization ability of the models with respect to the hierarchical and semantic similarity between classes. As discussed in the previous section, another characteristic of ImageNet is the imbalanced sample size. Therefore, our proposed split considers 500, 1K and 5K most populated classes among the remaining 21K classes of ImageNet with approximately 1756, 1624 and 1335 images per class on average. Similarly, we consider 500, 1K and 5K least-populated classes in ImageNet which correspond to most fine-grained subsets of ImageNet with approximately 1, 3 and 51 images per class on average. We measure the generalization of methods to the entire ImageNet data distribution by considering a final split of all the remaining approximately 20K classes of ImageNet with at least 1 image per-class, i.e. approximately 631 images per class on average. 4.5 evaluation protocol 65 4.5.3 Evaluation Criteria Single label image classification accuracy has been measured with Top-1 accuracy, i.e. the prediction is accurate when the predicted class is the correct one. If the accuracy is averaged for all images, high performance on densely populated classes is encouraged. However, we are interested in having high performance also on sparsely populated classes. Therefore, we average the correct predictions independently for each class before dividing their cumulative sum w.r.t the number of classes, i.e. we measure average per-class top-1 accuracy in the following way: accY = 1 ‖Y‖ ‖Y‖ ∑ c=1 # correct predictions in c # samples in c (4.21) In the generalized zero-shot learning setting, the search space at evaluation time is not restricted to only test classes (Yts), but includes also the training classes (Ytr), hence this setting is more practical. As with our proposed split at test time we have access to some images from training classes, after having computed the average per-class top-1 accuracy on training and test classes, we compute the harmonic mean of training and test accuracies: H = 2 ∗ accYtr ∗ accYts accYtr + accYts (4.22) where accYtr and accYts represent the accuracy of images from seen (Ytr), and images from unseen (Yts) classes respectively. We choose harmonic mean as our evaluation criteria and not arithmetic mean because in arithmetic mean if the seen class accuracy is much higher, it effects the overall results significantly. Instead, our aim is high accuracy on both seen and unseen classes. 66 chapter 4. zero-shot learning: the good, the bad and the ugly SUN CUB AWA1 aPY Model R O R O R O R O DAP 22.1 22.2 − − 41.4 41.4 19.1 19.1 SSE 83.0 82.5 44.2 30.4 64.9 76.3 45.7 46.2 LATEM − − 45.1 45.5 71.2 71.9 − − SJE − − 50.1 50.1 67.2 66.7 − − ESZSL 64.3 65.8 − − 48.0 49.3 14.3 15.1 SYNC 62.8 62.8 53.4 53.4 69.7 69.7 − − SAE − − − − 84.7 84.7 − − GFZSL 86.5 86.5 56.6 56.5 80.4 80.8 − − GFZSL-tran 87.0 87.0 63.8 63.7 94.9 94.3 − − DSRL 86.0 85.4 57.6 57.1 87.7 87.2 47.8 51.3 Table 4.2: Reproducing zero-shot results with methods that have a public implemen- tation: O = Original results, R = Reproduced using provided image features and code. We measure top-1 accuracy in %. −: image features are not provided in the original paper for this dataset. Top: ZSL, Bottom: transductive ZSL. 4.6 experiments We first provide ZSL results on the attribute datasets SUN, CUB, AWA1, AWA2 and aPY and then on the large-scale ImageNet dataset. Finally, we present results for the GZSL setting. 4.6.1 Zero-Shot Learning Experiments On attribute datasets, i.e. SUN, CUB, AWA1, AWA2, and aPY, we first reproduce the results of each method using their evaluation protocol, then provide a unified evaluation protocol using the same train/val/test class splits, followed by our proposed train/val/test class splits on SUN, CUB, AWA1, aPY and AWA2. We also evaluate the robustness of the methods to parameter tuning and visualize the ranking of different methods. Finally, we evaluate the methods on the large-scale ImageNet dataset. Comparing State-of-The-Art Models. For sanity-check, we re-evaluate methods (Lampert et al., 2013; Zhang and Saligrama, 2015; Xian et al., 2016; Akata et al., 2015c; Romera-Paredes et al., 2015; Changpinyo et al., 2016) and (Kodirov et al., 2017) using publicly available features and code from the original publication on SUN, CUB, AWA1 and aPY (CMT (Socher et al., 2013) evaluates on CIFAR dataset.). We observe from the results in Table 4.2 that our reproduced results of DAP(Lampert et al., 2013), SYNC (Changpinyo et al., 2016), GFZSL (Verm and Rai, 2017), GFZSL-tran (Verm and Rai, 2017), DSRL (Ye and Guo, 2017) and SAE (Kodirov et al., 2017) are nearly identical 4.6 experiments 67 to the reported number in their original publications. For LATEM (Xian et al., 2016), we obtain slightly different results which can be explained by the non-convexity and thus the sensibility to initialization. Similarly for SJE (Akata et al., 2015c) random sampling in SGD might lead to slightly different results. ESZSL (Romera-Paredes et al., 2015) has some variance because its algorithm randomly picks a validation set during each run, which leads to different hyperparameters. Notable observations on SSE (Zhang and Saligrama, 2015) results are as follows. The published code has hard-coded hyperparameters operational on aPY, i.e. number of iterations, number of data points to train SVM, and one regularizer parameter γ which lead to inferior results than the ones reported here, therefore we set these parameters on validation sets. On SUN, SSE uses 10 classes (instead of 72) and our results with validated parameters got an improvement of 0.5% that may be due to random sampling of training images. On AWA1, our reproduced result being 64.9% is significantly lower than the reported result (76.3%). However, we could not reach the reported result even by tuning parameters on the test set (73.8%). In addition to (Lampert et al., 2013; Zhang and Saligrama, 2015; Xian et al., 2016; Akata et al., 2015c; Romera-Paredes et al., 2015; Changpinyo et al., 2016; Socher et al., 2013; Kodirov et al., 2017), we re-implement (Norouzi et al., 2014; Frome et al., 2013; Akata et al., 2015a) based on the original publications. We use train, validation, test splits as provided in Table 5.1 and report results in Table 4.3 with deep ResNet features. DAP (Lampert et al., 2013) uses hand-crafted image features and thus reproduced results with those features are significantly lower than the results with deep features (22.1% vs 38.9%). When we investigate the results in detail, we noticed two irregularities with reported results on SUN. First, SSE (Zhang and Saligrama, 2015) and ESZSL (Romera-Paredes et al., 2015) report results on a test split with 10 classes whereas the standard split of SUN contains 72 test classes (74.5% vs 54.5% with SSE (Zhang and Saligrama, 2015) and 64.3% vs 57.3% with ESZSL (Romera- Paredes et al., 2015)). Second, after careful examination and correspondence with the authors of SYNC (Changpinyo et al., 2016), we detected that SUN features were extracted with a MIT Places (Zhou et al., 2014) pre-trained model. As the MIT Places dataset intersects with both training and test classes of SUN, it is expected to lead to significantly better results than ImageNet pre-trained models (62.8% vs 59.1%). In addition, while SAE (Kodirov et al., 2017) reported 84.7% on AWA1, we obtain only 80.7% on the standard split. This could be explained by two differences. First, we measure per-class accuracy but SAE (Kodirov et al., 2017) reports per-image accuracy which is typically higher when the dataset is class-imbalanced, e.g. AWA1. Indeed, their reported accuracy decreases to 82.0% if per-class accuracy is applied. Second, we confirmed with the authors of SAE (Kodirov et al., 2017) that they improved GoogleNet (Szegedy et al., 2015) by adding Batch Normalization and averaging 5 randomly cropped images to obtain better image features. Therefore, as expected, improving visual features lead to improved results in zero-shot learning. Promoting Our Proposed Splits (PS). We propose new dataset splits (see details in section 4.4) ensuring that test classes of any of the datasets do not overlap with the ImageNet1K used to pre-train ResNet. As training ResNet is a part of the training 68 chapter 4. zero-shot learning: the good, the bad and the ugly SUN CUB AWA1 AWA2 aPY Method SS PS SS PS SS PS SS PS SS PS DAP 38.9 39.9 37.5 40.0 57.1 44.1 58.7 46.1 35.2 33.8 IAP 17.4 19.4 27.1 24.0 48.1 35.9 46.9 35.9 22.4 36.6 CONSE 44.2 38.8 36.7 34.3 63.6 45.6 67.9 44.5 25.9 26.9 CMT 41.9 39.9 37.3 34.6 58.9 39.5 66.3 37.9 26.9 28.0 SSE 54.5 51.5 43.7 43.9 68.8 60.1 67.5 61.0 31.1 34.0 LATEM 56.9 55.3 49.4 49.3 74.8 55.1 68.7 55.8 34.5 35.2 ALE 59.1 58.1 53.2 54.9 78.6 59.9 80.3 62.5 30.9 39.7 DEVISE 57.5 56.5 53.2 52.0 72.9 54.2 68.6 59.7 35.4 39.8 SJE 57.1 53.7 55.3 53.9 76.7 65.6 69.5 61.9 32.0 32.9 ESZSL 57.3 54.5 55.1 53.9 74.7 58.2 75.6 58.6 34.4 38.3 SYNC 59.1 56.3 54.1 55.6 72.2 54.0 71.2 46.6 39.7 23.9 SAE 42.4 40.3 33.4 33.3 80.6 53.0 80.7 54.1 8.3 8.3 GFZSL 62.9 60.6 53.0 49.3 80.5 68.3 79.3 63.8 51.3 38.4 Table 4.3: Zero-shot learning results on SUN, CUB, AWA1, AWA2 and aPY using SS = Standard Split, PS = Proposed Split with ResNet features. The results report top-1 accuracy in %. procedure, including test classes in the dataset used for pre-training ResNet would violate the zero-shot learning conditions. We compare the results obtained with our proposed split (PS) with previously published standard split (SS) results in Table 4.3. Our first observation is that the results on the PS are significantly lower than the SS for AWA1 and AWA2. This is expected as most of the test classes of AWA1 and AWA2 in SS overlaps with ImageNet 1K. On the other hand, for fine-grained datasets CUB and SUN, the results are not significantly effected as the overlap in that case was not as significant. Our second observation regarding the method ranking is as follows. On SS, SYNC (Changpinyo et al., 2016) is the best performing method on SUN (59.1%) and aPY (39.7%) datasets whereas SJE (Akata et al., 2015c) performs the best on CUB (55.3%) and SAE (Kodirov et al., 2017) performs the best on AWA1 (80.6%) and AWA2 (80.7%) dataset. On PS, ALE (Akata et al., 2015a) performs the best on SUN (58.1%) and AWA2 (62.5%), SYNC (Changpinyo et al., 2016) on CUB (55.6%), SJE (Akata et al., 2015c) on AWA1 (65.6%) and DEVISE (Frome et al., 2013) on aPY (39.8%). ALE, SJE and DEVISE all use max-margin bi-linear compatibility learning framework which seem to perform better than others. It is also worth to note that SYNC and SAE perform well on SS, i.e. SYNC is the best performing model for SUN and aPY whereas SAE is for AWA1 and AWA2 on SS, while they perform significantly lower in PS which indicates that they do not generalize well in zero-shot learning task. Evaluating Robustness. We evaluate robustness of 13 methods, i.e. (Lampert et al., 2013; Zhang and Saligrama, 2015; Xian et al., 2016; Akata et al., 2015c; Romera- 4.6 experiments 69 Figure 4.3: Robustness of 10 methods evaluated on SUN, CUB, AWA1, aPY using 3 validation set splits (results are on the same test split). Top: original split, Bottom: proposed split (Image embeddings = ResNet). We measure top-1 accuracy in %. Paredes et al., 2015; Changpinyo et al., 2016; Socher et al., 2013; Norouzi et al., 2014; Frome et al., 2013; Akata et al., 2015a; Kodirov et al., 2017; Verm and Rai, 2017), to hyperparameters by setting them on 3 different validation splits while keeping the test split intact. We report results on SS (Figure 4.3, top) and PS (Figure 4.3, bottom) for SUN, CUB, AWA1, AWA2 and aPY datasets. On SUN and CUB, the results are stable across methods and across dataset splits. This is expected as these datasets both have a balanced number of images across classes and they are fine-grained datasets. Therefore, the validation splits are similar. On the other hand, aPY being a small and coarse-grained dataset has several issues. First, many of the test classes of aPY are included in ImageNet1K. Second, it is not well balanced, i.e. different validation class splits contain significantly different number of images. Third, the class embeddings are far from each other, i.e. objects are semantically different, therefore different validation splits learn a different mapping between images and classes. On AWA1 and AWA2, on SS, the DEVISE method seems to show the largest variance. This might be due to the fact that AWA1 and AWA2 datasets are also coarse-grained and test classes overlap with ImageNet training classes. Indeed, AWA2 being slightly more balanced than AWA1, in the proposed split it does not lead to such a high variance for DEVISE. Visualizing Method Ranking. We first evaluate the 13 methods using three different validation splits as in the previous experiment. We then rank them based on their per-class top-1 accuracy using the non-parametric Friedman test (Garcia and Herrera, 2008), which does not assume a distribution on performance but rather uses algorithm ranking. Each entry of the rank matrix on Figure 4.4 indicates the number of times the method is ranked at the first to thirteenth rank. We then compute the mean rank of each method and order them based on the mean rank across datasets. Our general observation is that the highest ranked method on both splits is GFZSL, the second highest ranked method on the standard split (SS) is SYNC while it drops to the seventh rank on the proposed split (PS). On the other hand, ALE ranks the second on the SS and the first on the PS. We reinforce our initial observation 70 chapter 4. zero-shot learning: the good, the bad and the ugly 6 1 1 2 1 4 3 4 1 1 1 2 2 1 3 3 6 3 1 4 1 3 2 3 1 3 1 5 1 4 1 3 1 4 6 1 1 1 2 3 6 2 1 1 3 2 1 6 1 2 1 3 5 3 1 3 1 4 1 6 2 6 1 6 3 8 1 3 3 12 1 2 3 4 5 6 7 8 9 10 11 12 13 Rank GFZSL [2.6] SYNC [3.9] ALE [4.5] DEVISE [4.8] ESZSL [5.0] SJE [5.2] LATEM [5.9] SAE [7.5] SSE [8.2] CONSE [10.1] DAP [10.2] CMT [10.3] IAP [12.8] 9 1 1 1 1 2 1 7 1 3 2 1 1 2 5 5 1 1 2 2 4 3 2 2 1 4 1 2 5 1 1 2 1 7 5 4 1 3 1 1 2 2 1 1 5 1 3 2 3 2 1 1 3 3 2 2 1 1 4 7 3 1 4 1 9 2 5 6 2 3 12 1 2 3 4 5 6 7 8 9 10 11 12 13 Rank GFZSL [2.8] ALE [3.1] DEVISE [3.9] SJE [4.6] ESZSL [5.4] SSE [5.7] LATEM [5.9] SYNC [6.9] DAP [9.4] SAE [10.6] CMT [10.7] CONSE [10.8] IAP [11.3] Figure 4.4: Ranking 12 models by setting parameters on three validation splits on the standard (SS, left) and proposed (PS, right) setting. Element (i, j) indicates number of times model i ranks at jth over all 4 × 3 observations. Models are ordered by their mean rank (displayed in brackets). from numerical results and conclude that GFZSL and ALE seems to be the method that is the most robust in zero-shot learning setting for attribute datasets. These results also indicate the importance of choosing zero-shot splits carefully. On the PS, the two of three highest ranked methods are compatibility learning methods, i.e. ALE and DEVISE whereas the three lowest ranked methods are attribute classifier learning or hybrid methods, i.e. IAP, CMT and CONSE. Therefore, max-margin compatibility learning methods lead to consistently better results in the zero-shot learning task compared to learning independent classifiers. Finally, visualizing the method ranking in this way provides a visually interpretable way of how models compare across datasets. Results on Our Proposed AWA2. We introduce AWA2 which has the same classes and attributes as AWA1, but contains different images each coming with a public copyright license. In order to show that AWA1 and AWA2 images are not the same but similar in nature, we compare the zero-shot learning results on AWA1 and AWA2 in Table. 4.3. Under the Standard Splits (SS), SAE (Kodirov et al., 2017) is the best performing method on both AWA1 (80.6%) and AWA2 (80.7%). Similarly, for most of the methods, the results on AWA1 are close to those on AWA2, for instance, DAP obtains 57.1% on AWA1 and 58.7% on AWA2, SSE obtains 68.8% on AWA1 and 67.5% AWA2, etc. The results under the Proposed Splits (PS) are also consistent across AWA1 and AWA2. For 8 out of 12 methods, the performance difference between AWA1 and AWA2 is within 2%. On the other hand, the same consistency is not observed for DEVISE (Frome et al., 2013), SJE (Akata et al., 2015c) and SYNC (Changpinyo et al., 2016). For instance, SJE (Akata et al., 2015c) obtains 65.6% on AWA1 and 61.9% on AWA2. After careful examination, we noticed that SJE (Akata et al., 2015c) selects different hyperparameters for AWA1 and AWA2, which results in different performance on those two datasets. In our opinion, this does not indicate a possible dataset artifact, however shows that zero-shot learning 4.6 experiments 71 Training Set : Test Set Method AWA1:AWA1 AWA1:AWA2 AWA2:AWA2 AWA2:AWA1 DAP 44.1 44.2 46.1 46.2 IAP 35.9 36.1 35.9 35.3 CONSE 45.6 46.5 44.5 43.7 CMT 39.5 40.7 37.9 37.7 SSE 60.1 61.6 61.0 59.8 LATEM 55.1 55.4 55.8 53.5 ALE 59.9 59.9 62.5 60.9 DEVISE 54.2 55.2 59.7 57.7 SJE 65.6 65.5 61.9 62.0 ESZSL 58.2 58.5 58.6 59.9 SYNC 54.0 53.7 46.6 46.9 SAE 53.0 52.4 54.1 53.1 Table 4.4: Cross-dataset evaluation over AWA1 and AWA2 in zero-shot learning setting on the Proposed Splits: Left of the colon indicates the training set and right of the colon indicates the test set, e.g. AWA1:AWA2 means that the model is trained on the train set of AWA1 and evaluated on the test set of AWA2. We measure top-1 accuracy in %. is sensitive to parameter setting. Commonly, a model is trained and evaluated on the same dataset. Across dataset experiments are not easy as different datasets do not share the same attributes. However, AWA1 and AWA2 share both classes and attributes. In order to verify that AWA2 is a good replacement for AWA1, we conduct across-dataset evaluation for 12 methods, i.e. (Lampert et al., 2013; Zhang and Saligrama, 2015; Xian et al., 2016; Akata et al., 2015c; Romera-Paredes et al., 2015; Changpinyo et al., 2016; Socher et al., 2013; Norouzi et al., 2014; Frome et al., 2013; Akata et al., 2015a; Kodirov et al., 2017). In particular, with our Proposed Splits (PS), we train one model on the training set of AWA1 and evaluate it on the test set of AWA2 in the zero-shot learning setting, and vice versa. From Table. 4.4, we observe that all the models trained on AWA1 generalize well to AWA2 and vice versa. In addition, we notice that the cross-dataset result is dependent on the training set. For instance, for all the methods, if we fix training set to be from AWA1, the results on the test set of AWA1 and AWA2 are close. To verify this hypothesis, we performed a paired t-test which determines if the mean difference between paired results is significantly higher than zero. To that end, we take the 24 pairs of results whose test sets are the same, i.e. the results obtained with 12 methods on AWA1:AWA2 and AWA2:AWA2 (2nd and 3rd column) as well as the results obtained with 12 methods on AWA1:AWA1 and AWA2:AWA1 (1st and 4th column). The paired t-test rejects the null hypothesis with p-value= 0.007, indicating that the results are significantly different if the test set is the same but the training set is different. As a conclusion, the training set is an important indicator of the final result and the two datasets, i.e. 72 chapter 4. zero-shot learning: the good, the bad and the ugly Hierarchy Most Populated Least Populated All Method 2 H 3 H 500 1K 5K 500 1K 5K 20K CONSE 7.63 2.18 12.33 8.31 3.22 3.53 2.69 1.05 0.95 CMT 2.88 0.67 5.10 3.04 1.04 1.87 1.08 0.33 0.29 LATEM 5.45 1.32 10.81 6.63 1.90 4.53 2.74 0.76 0.50 ALE 5.38 1.32 10.40 6.77 2.00 4.27 2.85 0.79 0.50 DEVISE 5.25 1.29 10.36 6.68 1.94 4.23 2.86 0.78 0.49 SJE 5.31 1.33 9.88 6.53 1.99 4.93 2.93 0.78 0.52 ESZSL 6.35 1.51 11.91 7.69 2.34 4.50 3.23 0.94 0.62 SYNC 9.26 2.29 15.83 10.75 3.42 5.83 3.52 1.26 0.96 SAE 4.89 1.26 9.96 6.57 2.09 2.50 2.17 0.72 0.56 GFZSL 1.45 −− 2.01 1.35 −− 1.40 1.11 0.13 −− Table 4.5: ImageNet with different splits: 2/3 H = classes with 2/3 hops away from the Ytr of ImageNet1K, 500/1K/5K most populated classes, 500/1K/5K least populated classes, All = The remaining 20K categories of ImageNet (Yts). We measure top-1 accuracy in %. AWA1 and AWA2 are sufficiently similar. Therefore, our cross-dataset experimental results indicate that AWA2 is a good replacement for AWA1. Zero-Shot Learning Results on ImageNet. ImageNet scales the methods to a truly large-scale setting, thus these experiments provide further insights on how to tackle the zero-shot learning problem from the practical point of view. Here, we evaluate 10 methods, i.e. (Xian et al., 2016; Akata et al., 2015c; Romera-Paredes et al., 2015; Changpinyo et al., 2016; Socher et al., 2013; Norouzi et al., 2014; Frome et al., 2013; Akata et al., 2015a; Kodirov et al., 2017; Verm and Rai, 2017). We exclude DAP and IAP as attributes are not available for all ImageNet classes as well as SSE (Zhang and Saligrama, 2015) due to scalability issues of the public implementation of the method. Table 4.5 shows that the best performing method is SYNC (Changpinyo et al., 2016) which may either indicate that it performs well in large-scale setting or it can learn under uncertainty due to usage of Word2Vec instead of attributes. Another possibility is Word2Vec may be tuned for SYNC as it is provided by the same authors. However, we refrain to make a strong claim as this would requires a full evaluation on class embeddings which is out of the scope of this chapter. On the other hand, GFZSL (Verm and Rai, 2017) which is the best performing model for attribute datasets perform poorly on ImageNet which may indicate that generative models require a strong class embedding space such as attributes to perform well on ZSL task. Note that due to the computational issues, we were not able to obtain results for GFZSL for 3H, M5K, L5K and All 20K classes. More detailed observations are as follows. The second highest performing method is ESZSL (Romera-Paredes et al., 2015) which is one of the linear embedding models that have an implicit regularization mechanism, which seems to be more effective than early stopping as an explicit regularizer. A general observation from the results of all the methods is that in the most populated classes, the results are 4.6 experiments 73 2H 3H M500 M1K M5K L500 L1K L5K All 0 2 4 6 8 10 12 14 16 T o p -1 A cc . (i n % ) CONSE CMT LATEM ALE DEVISE SJE ESZSL SYNC SAE 2H 3H M500 M1K M5K L500 L1K L5K All 0 5 10 15 20 25 30 35 40 T o p -5 A cc . (i n % ) CONSE CMT LATEM ALE DEVISE SJE ESZSL SYNC SAE 2H 3H M500 M1K M5K L500 L1K L5K All 0 10 20 30 40 50 T o p -1 0 A cc . (i n % ) CONSE CMT LATEM ALE DEVISE SJE ESZSL SYNC SAE Figure 4.5: Zero-Shot Learning experiments on Imagenet, measuring Top-1, Top-5 and Top-10 accuracy. 2/3 H = classes with 2/3 hops away from ImageNet1K training classes (Ytr), M500/M1K/M5K denote 500, 1K and 5K most populated classes, L500/L1K/L5K denote 500, 1K and 5K least populated classes, All = The remaining 20K categories of ImageNet. higher than the least populated classes which indicates that zero-shot learning on fine-grained ImageNet subsets is a more difficult task. Moreover, we conclude that the nature of the test set, e.g. type of the classes being tested, is more important than the number of classes. Therefore, the selection of the test set is an important aspect of zero-shot learning on large-scale datasets. Furthermore, for all methods we consistently observe a large drop in accuracy between 1K and 5K most populated classes which is expected as 5K contains ≈ 6.6M images, making the problem much more difficult than 1K (≈ 1624 images). It is worth to note that, measuring per-image accuracy in this case would lead to higher results if the labels of the highly populated class samples are predicted correctly. Finally, the largest test set, i.e. All 20K, the results are poor for all methods which indicates the difficulty of this problem where there is a large room for improvement. Several models in the literature evaluate Top-5 and Top-10 as well as Top-1 accuracy on ImageNet. Top-5 and Top-10 accuracy in this case is reasonable as an image usually contains multiple objects however by construction it is associated with a single label in ImageNet. Hence, we provide a comparison of the same 9 models according to all these three criteria in Figure 4.5. We observe that SYNC (Changpinyo et al., 2016) performs significantly better than other methods when the number of images is higher, e.g. 2H, M500, M1K, whereas the gap reduces when the number of images and the number of classes increase, e.g. 3H, L5K and All. In fact, when for All, all the methods perform similarly and poorly which indicates that there is a large room for improvement in this task. In fact, this observation carries on for all three accuracy measures. For Top-5 (middle) and Top-10 (right) accuracy although the numbers are as expected in general higher, the winning model remains as SYNC, significantly for 2H, M500 and M1K whereas the difference is smaller with 3H, L5H, L1K. On the other hand, all methods perform similarly when all 20K classes are tested. 74 chapter 4. zero-shot learning: the good, the bad and the ugly SUN CUB AWA1 AWA2 aPY Method ts tr H ts tr H ts tr H ts tr H ts tr H DAP 4.2 25.1 7.2 1.7 67.9 3.3 0.0 88.7 0.0 0.0 84.7 0.0 4.8 78.3 9.0 IAP 1.0 37.8 1.8 0.2 72.8 0.4 2.1 78.2 4.1 0.9 87.6 1.8 5.7 65.6 10.4 CONSE 6.8 39.9 11.6 1.6 72.2 3.1 0.4 88.6 0.8 0.5 90.6 1.0 0.0 91.2 0.0 CMT 8.1 21.8 11.8 7.2 49.8 12.6 0.9 87.6 1.8 0.5 90.0 1.0 1.4 85.2 2.8 CMT* 8.7 28.0 13.3 4.7 60.1 8.7 8.4 86.9 15.3 8.7 89.0 15.9 10.9 74.2 19.0 SSE 2.1 36.4 4.0 8.5 46.9 14.4 7.0 80.5 12.9 8.1 82.5 14.8 0.2 78.9 0.4 LATEM 14.7 28.8 19.5 15.2 57.3 24.0 7.3 71.7 13.3 11.5 77.3 20.0 0.1 73.0 0.2 ALE 21.8 33.1 26.3 23.7 62.8 34.4 16.8 76.1 27.5 14.0 81.8 23.9 4.6 73.7 8.7 DEVISE 16.9 27.4 20.9 23.8 53.0 32.8 13.4 68.7 22.4 17.1 74.7 27.8 4.9 76.9 9.2 SJE 14.7 30.5 19.8 23.5 59.2 33.6 11.3 74.6 19.6 8.0 73.9 14.4 3.7 55.7 6.9 ESZSL 11.0 27.9 15.8 12.6 63.8 21.0 6.6 75.6 12.1 5.9 77.8 11.0 2.4 70.1 4.6 SYNC 7.9 43.3 13.4 11.5 70.9 19.8 8.9 87.3 16.2 10.0 90.5 18.0 7.4 66.3 13.3 SAE 8.8 18.0 11.8 7.8 54.0 13.6 1.8 77.1 3.5 1.1 82.2 2.2 0.4 80.9 0.9 GFZSL 0.0 39.6 0.0 0.0 45.7 0.0 1.8 80.3 3.5 2.5 80.1 4.8 0.0 83.3 0.0 Table 4.6: Generalized Zero-Shot Learning on Proposed Split (PS) measuring ts = Top-1 accuracy on Yts, tr=Top-1 accuracy on Ytr, H = harmonic mean (CMT*: CMT with novelty detection). We measure top-1 accuracy in %. 4.6.2 Generalized Zero-Shot Learning Results In real world applications, image classification systems do not have access to whether a novel image belongs to a seen or unseen class in advance. Hence, generalized zero-shot learning is interesting from a practical point of view. Here, we use same models trained on ZSL setting on our proposed splits (PS). We evaluate performance on both Ytr and Yts (using held-out images). As shown in Table 4.6, generalized zero-shot learning results are significantly lower than zero-shot learning results. This is due to the fact that training classes are included in the search space which act as distractors for the images that come from test classes, e.g. most of the images that are being evaluated. An interesting observation is that compatibility learning frameworks, e.g. ALE, DEVISE, SJE, perform well on test classes. However, methods that learn independent attribute or object classifiers, e.g. DAP and CONSE, perform well on training classes. Due to this discrepancy, we evaluate the harmonic mean which takes a weighted average of training and test class accuracy as shown in Equation 4.17. The harmonic mean measure ranks ALE as the best performing method on SUN, CUB and AWA1 datasets whereas on our AWA2 dataset DEVISE performs the best and on aPY dataset CMT* performs the best. Note that CMT* has an integrated novelty detection phase for which the method receives another supervision signal determining if the image belongs to a training or a test class. Similar to the ImageNet results, GFZSL (Verm and Rai, 2017) performs poorly on GZSL setting. As for the generalized zero-shot learning setting on ImageNet, we report results measured on unseen classes as no images are reserved from seen classes on Figure 4.6. Our first observation is that there is no winner model in all cases, the results diverge for different splits and different accuracy measures. For instance, when the 4.6 experiments 75 2H 3H M500 M1K M5K L500 L1K L5K All 0 0.5 1 1.5 2 2.5 3 T o p -1 A cc . (i n % ) CONSE CMT LATEM ALE DEVISE SJE ESZSL SYNC SAE 2H 3H M500 M1K M5K L500 L1K L5K All 0 5 10 15 T o p -5 A cc . (i n % ) CONSE CMT LATEM ALE DEVISE SJE ESZSL SYNC SAE 2H 3H M500 M1K M5K L500 L1K L5K All 0 5 10 15 20 25 T o p -1 0 A cc . (i n % ) CONSE CMT LATEM ALE DEVISE SJE ESZSL SYNC SAE Figure 4.6: GZSL on Imagenet, measuring Top-1, Top-5 and Top-10 accuracy. 2/3H: classes with 2/3 hops away from ImageNet1K Ytr, M500/M1K/M5K: 500/1K/5K most populated classes, L500/L1K/L5K: 500/1K/5K least populated classes, All: Remaining 20K classes. performance is measured with Top-1 accuracy, in general the best performing model seems to be DEVISE, ALE and SJE which are all linear compatibility learning models. On the other hand, for Top-5 accuracy different models take the lead in different splits, e.g. CONSE works the best for 3H and M5K indicating that it performs better when the number of images that come from unseen classes is larger. Whereas SJE and ESZSL works better for 2H, M500, L5H settings. Finally, for Top-10 accuracy, the best performing model overall is ESZSL which is the model that learns a linear compatibility with an explicit regularization scheme. Finally, for Top-1, Top-5 and Top-10 results we observe the same trend for when all the unseen classes are included in the test set, i.e. the models perform similarly however CONSE slightly stands out for Top-5 and Top-10 accuracy plots. In summary, generalized zero-shot learning setting provides one more level of detail on the performance of zero-shot learning methods. Our take-home message is that the accuracy of training classes is as important as the accuracy of test classes in real world scenarios. Therefore, methods should be designed in a way that they are able to predict labels well both in train and test classes. Visualizing Method Ranking. Similar to the analysis in the previous section that was conducted for zero-shot learning setting, we rank the 13 methods, i.e. (Lampert et al., 2013; Zhang and Saligrama, 2015; Xian et al., 2016; Akata et al., 2015c; Romera- Paredes et al., 2015; Changpinyo et al., 2016; Socher et al., 2013; Norouzi et al., 2014; Frome et al., 2013; Akata et al., 2015a; Kodirov et al., 2017; Verm and Rai, 2017), based on their results obtained on SUN, CUB, AWA1, AWA2 and aPY. The performance is measured on seen classes, unseen classes and the Harmonic mean of the two. The rank matrix of test classes, i.e. Figure 4.7 top left, shows that highest ranked methods,i.e. ALE, DEVISE, SJE, although overall the absolute accuracy numbers are lower (Table 4.6). Note that in Figure 4.4 GFZSL ranked highest which shows that GFZSL is not as strong for GZSL task. The rank matrix of the harmonic mean shows the same trend. However, the rank matrix of training classes, i.e. Figure 4.7 top right, shows that models that learn intermediate attribute classifiers perform well for the images that come from training classes. However, these models typically do not lead to a high accuracy for the images that belong to unseen classes as shown 76 chapter 4. zero-shot learning: the good, the bad and the ugly 6 5 2 2 6 5 1 1 2 1 2 4 5 1 2 2 4 5 3 1 1 1 4 7 1 1 2 1 1 3 3 5 3 2 2 1 3 3 1 2 3 4 2 3 1 1 6 2 2 4 1 1 1 4 4 2 2 1 3 4 2 5 1 2 6 5 1 6 1 8 1 2 3 4 5 6 7 8 9 10 11 12 13 Rank ALE [2.2] DEVISE [2.3] SJE [4.8] LATEM [5.3] SAE [5.3] ESZSL [5.5] SYNC [5.6] SSE [7.9] CMT [9.5] IAP [9.6] DAP [10.4] CONSE [11.1] GFZSL [11.5] 6 3 1 4 1 7 5 1 2 3 2 3 2 4 1 2 4 1 1 2 2 1 2 2 4 1 5 2 1 3 4 1 3 1 3 2 1 4 3 1 1 3 1 2 1 1 2 8 3 3 1 3 1 3 1 2 3 5 2 3 1 3 2 3 1 1 1 1 2 1 3 3 1 2 1 1 1 1 1 1 1 2 1 1 3 6 1 2 3 4 5 6 7 8 9 10 11 12 13 Rank CONSE [1.9] IAP [5.2] CMT [6.0] SYNC [6.0] DAP [6.1] ALE [6.3] GFZSL [6.9] SSE [8.0] ESZSL [8.4] DEVISE [8.5] SAE [8.8] LATEM [8.9] SJE [10.1] 8 3 2 2 4 4 3 1 1 2 1 7 3 3 1 1 3 3 1 6 1 2 1 1 6 2 3 3 6 2 3 1 2 2 4 4 2 1 5 2 5 3 1 2 5 4 2 1 1 2 4 4 2 2 1 2 1 5 2 4 1 6 1 6 1 6 1 8 1 2 3 4 5 6 7 8 9 10 11 12 13 Rank ALE [1.9] DEVISE [2.4] SJE [4.3] SYNC [4.7] ESZSL [5.1] LATEM [5.1] SSE [7.4] SAE [8.7] IAP [9.3] CMT [9.5] DAP [10.2] CONSE [11.1] GFZSL [11.3] Figure 4.7: Ranking 13 models on the proposed split (PS) in generalized zero-shot learning setting. Top-Left: Top-1 accuracy (T1) is measured on unseen classes (ts), Top-Right: T1 is measured on seen classes (tr), Bottom: T1 is measured on Harmonic mean (H). in Table 4.6. This eventually makes the harmonic mean, i.e. the overall accuracy on both training and test classes, lower. These results clearly suggest that one should not only optimize for test class accuracy but also for training class accuracy while evaluating generalized zero-shot learning. Our final observation from Figure 4.7 is that CMT* is better than CMT in all cases which supports the argument that a simple novelty detection scheme helps to improve results. However, it is important to note that the proposed novelty detection mechanism uses more supervision than classic zero-shot learning models. Although the label of test classes is not used, whether the sample comes from a seen or unseen class is an additional supervision. 4.6.3 Transductive (Generalized) Zero-Shot Learning In contrast to previous zero-shot learning approaches that learn only with data from training classes, transductive approaches use unlabaled images from test classes. 4.7 conclusion 77 Figure 4.8: Zero-shot (left) and generalized zero-shot learning (right) results in the transductive learning setting on our Proposesd Split. In this section, we evaluate three state-of-the-art transductive ZSL approaches, i.e. DSRL (Ye and Guo, 2017), GFZSL-tran (Verm and Rai, 2017), and ALE-tran (Akata et al., 2015a). Similar to the previous section, we evaluate those approaches on our proposed splits in both zero-shot learning where test time search space is composed of only unseen classes and generalized zero-shot learning where it contains both seen and unseen classes. The performance is per-class averaged top-1 accuracy. Our transductive learning results are presented in Figure 4.8. We observe that in ZSL setting, transductive learning leads to accuracy improvement, e.g. ALE-tran and GFZSL-tran outperforms ALE and GFZSL respectively in almost all cases. In partic- ular, on AWA2, GFZSL-tran achieves 78.6%, significantly improving GFZSL (63.8%). On APY, ALE-tran obtains 45.5% and significantly improves ALE (37.1%) as well. Moreover, GFZSL-tran outperforms ALE-tran and DSRL on SUN, AWA1 and AWA2. However, ALE-tran performs the best on CUB and APY. In GZSL setting we observe a different trend, i.e. transductive learning does not improve results for ALE in any of the datasets. Although, on AWA1 and AWA2 GFZSL results improve significantly for the transductive learning setting, on other datasets GFZSL model performs poorly both in inductive and in transductive settings. 4.7 conclusion In this work, we evaluated a significant number of state-of-the-art zero-shot learning methods, i.e. (Lampert et al., 2013; Zhang and Saligrama, 2015; Xian et al., 2016; Akata et al., 2015c; Romera-Paredes et al., 2015; Changpinyo et al., 2016; Socher et al., 2013; Norouzi et al., 2014; Frome et al., 2013; Akata et al., 2015a; Kodirov et al., 2017; Verm and Rai, 2017; Ye and Guo, 2017), on several datasets, i.e. SUN, CUB, AWA1, AWA2, aPY and ImageNet, within a unified evaluation protocol both in zero-shot and generalized zero-shot settings. Our evaluation showed that generative models and compatibility learning frame- 78 chapter 4. zero-shot learning: the good, the bad and the ugly works have an edge over learning independent object or attribute classifiers and also over other hybrid models for the classic zero-shot learning setting. We observed that unlabeled data of unseen classes can further improve the zero-shot learning results, thus it is not fair to compare transductive learning approaches with inductive ones. We discovered that some standard zero-shot dataset splits may treat feature learning disjoint from the training stage as several test classes are included in the ImageNet1K dataset that is used to train the deep neural networks that act as feature extractor. Therefore, we proposed new dataset splits making sure that none of the test classes in none of the datasets belong to ImageNet1K. Moreover, disjoint training and validation class split is a necessary component of parameter tuning in zero-shot learning setting. In addition, we introduced a new Animal with Attributes (AWA2) dataset. AWA2 inherits the same 50 classes and attributes annotations from the original Animal with Attributes (AWA1) dataset, but consists of different 37, 322 images with publicly available redistribution license. Our experimental results showed that the 12 methods that we evaluated perform similarly on AWA2 and AWA1. Moreover, our statistical consistency test indicated that AWA1 and AWA2 are compatible with each other. Finally, including training classes in the search space while evaluating the meth- ods, i.e. generalized zero-shot learning, provides an interesting playground for future research. Although the generalized zero-shot learning accuracy obtained with 13 models compared to their zero-shot learning accuracy is significantly lower, the relative performance comparison of different models remain the same. Having noticed that some models perform well when the test set is composed only of seen classes, while some others perform well when the test set is composed of only of unseen classes, we proposed the Harmonic mean of seen and unseen class accuracy as a unified measure for performance in GZSL setting. The Harmonic mean encour- ages the models to perform well on both seen and unseen class samples, which is closer to a real world setting. In summary, our work extensively evaluated the good and bad aspects of zero-shot learning while sanitizing the ugly ones. 5 F E A T U R E G E N E R A T I N G N E T W O R K S F O R Z E R O - S H O T I M A G E C L A S S I F I C A T I O N Contents 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.3 Feature Generation & Classification in ZSL . . . . . . . . . . . . . . . 82 5.3.1 Feature Generation . . . . . . . . . . . . . . . . . . . . . . . . 82 5.3.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.4.1 Comparing with State-of-the-Art . . . . . . . . . . . . . . . . 87 5.4.2 Analyzing f-xGAN Under Different Conditions . . . . . . . 89 5.4.3 Large-Scale Experiments . . . . . . . . . . . . . . . . . . . . . 92 5.4.4 Feature vs Image Generation . . . . . . . . . . . . . . . . . . 93 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 I n Chapter 4, we observe that almost all zero-shot learning approaches fail to predict novel classes in the realistic generalized zero-shot learning setting. In this chapter, our goal is to develop methods to tackle generalized zero-shot learning under the benchmark proposed in Chapter 4. In a high-level point of view, we propose to learn a feature generator that synthesizes visual features for novel classes. The generated features alleviate the imbalanced issues and consistently improve the zero-shot and generalized zero-shot learning results. In Chapter 6, we extend the approach introduced this chapter by improving the generative model and incorporating unlabeled data. We also show the effectiveness of our approach on few-shot learning tasks. Chapter 7 defines and addresses the zero-shot and few-shot learning problems in the scenario of semantic segmentation. Chapter 8 tackles few-shot learning challenges arised in video action classification tasks. 5.1 introduction Deep learning has allowed to push performance considerably across a wide range of computer vision and machine learning tasks. However, almost always, deep learning requires large amounts of training data which we are lacking in many practical scenarios, e.g. it is impractical to annotate all the concepts that surround us, and have enough of those annotated samples to train a deep network. Therefore, training data generation has become a hot research topic (e.g. Chawla et al., 2002; 79 80chapter 5. feature generating networks for zero-shot image classification Head color: red Back color: black Crown color: red Wing shape: short z ~ N(0, 1) G(z, a) seen unseen ResNet space ? x x g f-CLSWGAN CNN CNN CNN feature space synthetic image real image Head color: brown Belly color: yellow Bill shape: pointy Figure 5.1: CNN features can be extracted from: 1) real images, however in zero-shot learning we do not have access to any real images of unseen classes, 2) synthetic images, however they are not accurate enough to improve image classification performance. We tackle both of these problems and propose a novel attribute conditional feature generating adversarial network formulation, i.e. f-CLSWGAN, to generate CNN features of unseen classes. Goodfellow et al., 2014; Chen and Koltun, 2017; Reed et al., 2016c; Zhang et al., 2017a; Salimans et al., 2016). Generative Adversarial Networks (Goodfellow et al., 2014) are particularly appealing as they allow generating realistic and sharp images conditioned, for instance, on object categories (e.g. Reed et al., 2016c; Zhang et al., 2017a). However, they do not yet generate images of sufficient quality to train deep learning architectures as demonstrated by our experimental results. In this work, we are focusing on arguably the most extreme case of lacking data, namely zero-shot learning (e.g. Lampert et al., 2013; Xian et al., 2017; Chao et al., 2016), where the task is to learn to classify when no labeled examples of certain classes are available during training. We argue that this scenario is a great testbed for evaluating the robustness and generalization of generative models. In particular, if the generator learns discriminative visual data with enough variation, the generated data should be useful for supervised learning. Hence, one contribution of this chapter is a comparison of various existing GAN-models and another competing generative model, i.e. GMMN, for visual feature generation. In particular, we look into both zero-shot learning (ZSL) where the test time search space is restricted to unseen class labels and generalized zero-shot learning (GZSL) for being a more realistic scenario as at test time the classifier has to decide between both seen and unseen class labels. In this context, we propose a novel GAN-method – namely f-CLSWGAN that generates features instead of images and is trained with a novel loss improving over alternative GAN-models. We summarize our contributions as follows. (1) We propose a novel conditional generative model f-CLSWGAN that synthesizes CNN features of unseen classes by optimizing the Wasserstein distance regularized by a classification loss. (2) Across 5.2 related work 81 five datasets with varying granularity and sizes, we consistently improve upon the state of the art in both the ZSL and GZSL settings. We demonstrate a practical application for adversarial training and propose GZSL as a proxy task to evaluate the performance of generative models. (3) Our model is generalizable to different deep CNN features, e.g. extracted from GoogleNet or ResNet, and may use different class-level auxiliary information, e.g. sentence, attribute, and word2vec embeddings. 5.2 related work In this section we review some recent relevant literature on Generative Adversarial Networks, Zero-Shot Learning (ZSL) and Generalized Zero-Shot (GZSL) Learning. Generative Adversarial Network. GAN (Goodfellow et al., 2014) was originally proposed as a means of learning a generative model which captures an arbitrary data distribution, such as images, from a particular domain. The input to a generator network is a “noise” vector z drawn from a latent distribution, such as a multivariate Gaussian. DCGAN (Radford et al., 2016) extends GAN by leveraging deep convo- lution neural networks and providing best practices for GAN training. (Wang and Gupta, 2016) improves DCGAN by factorizing the image generation process into style and structure networks, InfoGAN (Chen et al., 2016) extends GAN by addition- ally maximizing the mutual information between interpretable latent variables and the generator distribution. GAN has also been extended to a conditional GAN by feeding the class label (Mirza and Osindero, 2014), sentence descriptions (Reed et al., 2016b,c; Zhang et al., 2017a), into both the generator and discriminator. The theory of GAN is recently investigated in (Arjovsky and Bottou, 2017; Arjovsky et al., 2017; Gulrajani et al., 2017), where they show that the Jenson-shannon divergence opti- mized by the original GAN leads to instability issues. To cure the unstable training issues of GANs, (Arjovsky et al., 2017) proposes Wasserstein-GAN (WGAN), which optimizes an efficient approximation of the Earth Mover, i.e. Wasserstein-1, distance. While WGAN attains better theoretical properties than the original GAN, it still suffers from vanishing and exploding gradient problems due to weight clipping to enforce the 1-Lipschitz constraint on the discriminator. Hence, (Gulrajani et al., 2017) proposes an improved version of WGAN enforcing the Lipschitz constraint through gradient penalty. Although those papers have demonstrated realistic looking images, they have not applied this idea to image feature generation. In this chapter, we empirically show that images generated by the state-of-the-art GAN (Gulrajani et al., 2017) are not ready to be used as training data for learning a classifier. Hence, we propose a novel GAN architecture to directly generate CNN features that can be used to train a discriminative classifier for zero-shot learning. Combining the powerful WGAN (Gulrajani et al., 2017) loss and a classification loss which enforces the generated features to be discriminative, our proposed GAN architecture improves the original GAN (Goodfellow et al., 2014) by a large margin and has an edge over WGAN (Gulrajani et al., 2017) thanks to our regularizer. For zero-shot and generalized zero-shot learning literature, readers can refer to 82chapter 5. feature generating networks for zero-shot image classification Chapter 2. In this chapter, we propose to tackle generalized zero-shot learning by generating CNN features for unseen classes via a novel GAN model. Our work is different from (Hariharan and Girshick, 2017) because they generate additional examples for data-starved classes from feature vectors alone, which is unimodal and do not generalize to unseen classes. Our work is closer to (Bucher et al., 2017) in which they generate features via GMMN (Li et al., 2015). Hence, we directly compare with them on the latest zero-shot learning benchmark (Xian et al., 2017) and show that WGAN (Arjovsky et al., 2017) coupled with our proposed classification loss can further improve GMMN in feature generation on most datasets for both ZSL and GZSL tasks. 5.3 feature generation & classification in zsl Existing ZSL models only see labeled data from seen classes during training biasing the predictions to seen classes. The main insight of our proposed model is that by feeding additional synthetic CNN features of unseen classes, the learned classifier will also explore the embedding space of unseen classes. Hence, the key to our approach is the ability to generate semantically rich CNN feature distributions conditioned on a class specific semantic vector e.g. attributes, without access to any images of that class. This alleviates the imbalance between seen and unseen classes, as there is no limit to the number of synthetic CNN features that our model can generate. It also allows to directly train a discriminative classifier, i.e. Softmax classifier, even for unseen classes. We begin by defining the problem of our interest. Let S = {(x, y, c(y))|x ∈ X , y ∈ Ys, c(y) ∈ C} where S stands for the training data of seen classes, x ∈ Rdx is the CNN features, y denotes the class label in Ys = {y1, . . . , yK} consisting of K discrete seen classes, and c(y) ∈ Rdc is the class embedding, e.g. attributes, of class y that models the semantic relationship between classes. In addition, we have a disjoint class label set Yu = {u1, . . . , uL} of unseen classes, whose class embedding set U = {(u, c(u))|u ∈Yu, c(u) ∈C} is available but images and image features are missing. Given S and U, the task of ZSL is to learn a classifier fzsl : X →Yu and in GZSL we learn a classifier fgzsl : X →Ys ∪Yu. 5.3.1 Feature Generation In this section, we begin our discussion with Generative Adversarial Networks (GAN) (Goodfellow et al., 2014) for it being the basis of our model. GAN consists of a generative network G and a discriminative network D that compete in a two player minimax game. In the context of generating image pixels, D tries to accurately distinguish real images from generated images, while G tries to fool the discriminator by generating images that are mistakable for real. Following (Mirza and Osindero, 2014), we extend GAN to conditional GAN by including a conditional variable to 5.3 feature generation & classification in zsl 83 both G and D. In the following we give the details of the conditional GAN variants that we develop. Our novelty lies in that we develop three conditional GAN variants, i.e. f-GAN, f-WGAN and f-CLSWGAN, to generate image features rather than image pixels. It is worth noting that our models are only trained with seen class data S but can also generate image features of unseen classes. f-GAN. Given the train data S of seen classes, we aim to learn a conditional generator G : Z ×C → X , which takes random Gaussian noise z ∈ Z ⊂ Rdz and class embedding c(y) ∈C as its inputs, and outputs a CNN image feature x̃ ∈X of class y. Once the generator G learns to generate CNN features of real images, i.e. x, conditioned on the seen class embedding c(y) ∈ Ys, it can also generate x̃ of any unseen class u via its class embedding c(u). Our feature generator f-GAN is learned by optimizing the following objective, min G max D LG AN =E[log D(x, c(y))]+ (5.1) E[log (1 − D(x̃, c(y)))], with x̃ = G(z, c(y)). The discriminator D : X ×C → [0, 1] is a multi-layer perceptron with a sigmoid function as the last layer. While D tries to maximize the loss, G tries to minimizes it. Although GAN has been shown to capture complex data distributions, e.g. pixel images, they are notoriously difficult to train (Arjovsky and Bottou, 2017). f-WGAN. We extend the improved WGAN (Gulrajani et al., 2017) to a conditional WGAN by integrating the class embedding c(y) to both the generator and the discriminator. The loss is, LW G AN =E[D(x, c(y))]− E[D(x̃, c(y))]− (5.2) λE[(||∇x̂ D(x̂, c(y))||2 − 1) 2 ], where x̃ = G(z, c(y)), x̂ = αx + (1 − α)x̃ with α ∼ U(0, 1), and λ is the penalty coefficient. In contrast to the GAN, the discriminative network here is defined as D : X ×C → R, which eliminates the sigmoid layer and outputs a real value. The log in Equation 5.1 is also removed since we are not optimizing the log likelihood. Instead, the first two terms in Equation 6.1 approximate the Wasserstein distance, and the third term is the gradient penalty which enforces the gradient of D to have unit norm along the straight line between pairs of real and generated points. Again, we solve a minmax optimization problem, min G max D LW G AN (5.3) f-CLSWGAN. f-WGAN does not guarantee that the generated CNN features are well suited for training a discriminative classifier, which is our goal. We conjecture that this issue could be alleviated by encouraging the generator to construct features that can be correctly classified by a discriminative classifier trained on the input 84chapter 5. feature generating networks for zero-shot image classification CNN f-CLSWGAN z ~ N(0, 1) Head color: brown Belly color: yellow Bill shape: pointy discriminator generator Head color: brown Belly color: yellow Bill shape: pointy Figure 5.2: Our f-CLSWGAN: we propose to minimize the classification loss over the generated features and the Wasserstein distance with gradient penalty. data. To this end, we propose to minimize the classification loss over the generated features in our novel f-CLSWGAN formulation. We use the negative log likelihood, LCLS = −Ex̃∼px̃ [log P(y|x̃; θ)], (5.4) where x̃ = G(z, c(y)), y is the class label of x̃, P(y|x̃; θ) denotes the probability of x̃ being predicted with its true class label y. The conditional probability is computed by a linear softmax classifier parameterized by θ, which is pretrained on the real features of seen classes. The classification loss can be thought of as a regularizer enforcing the generator to construct discriminative features. Our full objective then becomes, min G max D LW G AN + βLCLS (5.5) where β is a hyperparameter weighting the classifier. 5.3.2 Classification Given c(u) of any unseen class u ∈Yu, by resampling the noise z and then recom- puting x̃ = G(z, c(u)), arbitrarily many visual CNN features x̃ can be synthesized. After repeating this feature generation process for every unseen class, we obtain a synthetic training set Ũ = {(x̃, u, c(u))}. We then learn a classifier by training either a multimodal embedding model or a softmax classifier. Our generated features allow to train those methods on the combinations of real seen class data S and generated unseen class data Ũ. Multimodal Embedding. Many efficient zero-shot learning approaches, e.g. (Akata et al., 2015a), DEVISE (Frome et al., 2013), SJE (Akata et al., 2015c), ESZSL (?) and 5.4 experiments 85 LATEM (Xian et al., 2016), learn a multimodal embedding between the image feature space X and the class embedding space C using seen classes data S. With our generated features, those methods can be trained with seen classes data S together with unseen classes data Ũ to learn a more robust classifier. The embedding model F(x, c(y); W), parameterized by W, measures the compatibility score between any image feature x and class embedding c(y) pair. Given a query image feature x, the classifier searches for the class embedding with the highest compatibility via: f (x) = argmax y F(x, c(y); W), (5.6) where in ZSL, y ∈Yu and in GZSL, y ∈Ys ∪Yu. Softmax. The standard softmax classifier minimizes the negative log likelihood loss, min θ − 1 |T | ∑ (x,y)∈T log P(y|x; θ), (5.7) where θ ∈ Rdx×N is the weight matrix of a fully connected layer which maps the image feature x to N unnormalized probabilities with N being the number of classes, and P(y|x; θ) = exp(θTy x) ∑Ni exp(θ T i x) . Depending on the task, T = Ũ if it is ZSL and T = S∪Ũ if it is GZSL. The prediction function is: f (x) = arg max y P(y|x; θ), (5.8) where in ZSL, y ∈Yu and in GZSL, y ∈Ys ∪Yu. 5.4 experiments First we detail our experimental protocol, then we present (1) our results comparing our framework with the state of the art for GZSL and ZSL tasks on four challenging datasets, (2) our analysis of f-xGAN 4 under different conditions, (3) our large-scale experiments on ImageNet and (4) our comparison of image and image feature generation. Datasets. Caltech-UCSD-Birds 200-2011 (CUB) (Welinder et al., 2010), Oxford Flowers (FLO) (Nilsback and Zisserman, 2008) and SUN Attribute (SUN) (Patterson and Hays, 2012) are all fine-grained datasets. CUB contains 11,788 images from 200 different types of birds annotated with 312 attributes. FLO dataset 8189 images from 102 different types of flowers without attribute annotations. However, for both CUB and FLO we use the fine-grained visual descriptions collected by (Reed et al., 2016a). SUN contains 14,340 images from 717 scenes annotated with 102 attributes. Finally, Animals with Attributes (AWA) (Lampert et al., 2013) is a coarse-grained 4We denote our f-GAN, f-WGAN, f-CLSWGAN as f-xGAN 86chapter 5. feature generating networks for zero-shot image classification Dataset att stc |Ys|+ |Yu| |Ys| |Yu| CUB (Welinder et al., 2010) 312 Y 200 100 + 50 50 FLO (Nilsback and Zisserman, 2008) – Y 102 62 + 20 20 SUN (Patterson and Hays, 2012) 102 N 717 580 + 65 72 AWA (Lampert et al., 2013) 85 N 50 27 + 13 10 Table 5.1: CUB, SUN, FLO, AWA datasets, in terms of number of attributes per class (att), sentences (stc), number of classes in training + validation (Ys) and test classes (Yu). dataset with 30,475 images, 50 classes and 85 attributes. Statistics of the datasets are presented in Table 5.1. We use the zero-shot splits proposed by (Xian et al., 2017) for AWA, CUB and SUN insuring that none of the training classes are present in ImageNet (Deng et al., 2009)5. For FLO, we use the standard split provided by (Reed et al., 2016a). Features. As real CNN features, we extract 2048-dim top-layer pooling units of the 101-layered ResNet (He et al., 2016) from the entire image. We do not do any image pre-processing such as cropping, background subtraction etc, or use any other data augmentation techniques. ResNet is pre-trained on ImageNet 1K and not fine-tuned. As synthetic CNN features, we generate 2048-dim CNN features using our f-xGAN model. As the class embedding, unless it is stated otherwise, we use per-class attributes for AWA (85-dim), CUB (312-dim) and SUN (102-dim). Furthermore, for CUB and Flowers, we extract 1024-dim character-based CNN-RNN (Reed et al., 2016a) features from fine-grained visual descriptions (10 sentences per image). None of the Yu sentences are seen during training the CNN-RNN. We build per-class sentences by averaging the CNN-RNN features that belong to the same class. Evaluation Protocol. At test time, in the ZSL setting, the aim is to assign an unseen class label, i.e. Yu to the test image and in GZSL setting, the search space includes both seen or unseen classes, i.e. Ys ∪Yu. We use the unified evaluation protocol proposed in (Xian et al., 2017). In the ZSL setting, the average accuracy is computed independently for each class before dividing their cumulative sum by the number of classes; i.e., we measure average per-class top-1 accuracy (T1). In the GZSL setting, we compute average per-class top-1 accuracy on seen classes (Ys) denoted as s, average per-class top-1 accuracy on unseen classes (Yu) denoted as u and their harmonic mean, i.e. H = 2 ∗ (s ∗ u)/(s + u). Implementation details. In all f-xGAN models, both the generator and the discrimi- nator are MLP with LeakyReLU activation. The generator consists of a single hidden layer with 4096 hidden units. Its output layer is ReLU because we aim to learn the top max-pooling units of ResNet-101. While the discriminator of f-GAN has 5as ImageNet is used for pre-training the ResNet (He et al., 2016) 5.4 experiments 87 Zero-Shot Learning Generalized Zero-Shot Learning CUB FLO SUN AWA CUB FLO SUN AWA Classifier FG T1 T1 T1 T1 u s H u s H u s H u s H DEVISE none 52.0 45.9 56.5 54.2 23.8 53.0 32.8 9.9 44.2 16.2 16.9 27.4 20.9 13.4 68.7 22.4 f-CLSWGAN 60.3 60.4 60.9 66.9 52.2 42.4 46.7 45.0 38.6 41.6 38.4 25.4 30.6 35.0 62.8 45.0 SJE none 53.9 53.4 53.7 65.6 23.5 59.2 33.6 13.9 47.6 21.5 14.7 30.5 19.8 11.3 74.6 19.6 f-CLSWGAN 58.4 67.4 56.5 66.9 48.1 37.4 42.1 52.1 56.2 54.1 36.7 25.0 29.7 37.9 70.1 49.2 LATEM none 49.3 40.4 55.3 55.1 15.2 57.3 24.0 6.6 47.6 11.5 14.7 28.8 19.5 7.3 71.7 13.3 f-CLSWGAN 60.8 60.8 61.3 69.9 53.6 39.2 45.3 47.2 37.7 41.9 42.4 23.1 29.9 33.0 61.5 43.0 ESZSL none 53.9 51.0 54.5 58.2 12.6 63.8 21.0 11.4 56.8 19.0 11.0 27.9 15.8 6.6 75.6 12.1 f-CLSWGAN 54.7 54.3 54.0 63.9 36.8 50.9 43.2 25.3 69.2 37.1 27.8 20.4 23.5 31.1 72.8 43.6 ALE none 54.9 48.5 58.1 59.9 23.7 62.8 34.4 13.3 61.6 21.9 21.8 33.1 26.3 16.8 76.1 27.5 f-CLSWGAN 61.5 71.2 62.1 68.2 40.2 59.3 47.9 54.3 60.3 57.1 41.3 31.1 35.5 47.6 57.2 52.0 Softmax none – – – – – – – – – – – – – – – – f-CLSWGAN 57.3 67.2 60.8 68.2 43.7 57.7 49.7 59.0 73.8 65.6 42.6 36.6 39.4 57.9 61.4 59.6 Table 5.2: ZSL measuring per-class average Top-1 accuracy (T1) on Yu and GZSL measuring u = T1 on Yu, s = T1 on Ys, H = harmonic mean (FG=feature generator, none: no access to generated CNN features, hence softmax is not applicable). f- CLSWGAN significantly boosts both the ZSL and GZSL accuracy of all classification models on all four datasets. one hidden layer with 1024 hidden units in order to stabilize the GAN training, the discriminators of f-WGAN and f-CLSWGAN have one hidden layer with 4096 hidden units as WGAN (Gulrajani et al., 2017) does not have instability issues thus a stronger discriminator can be applied here. We do not apply batch normalization our empirical evaluation showed a significant degradation of the accuracy when batch normalization is used. The noise z is drawn from a unit Gaussian with the same dimensionality as the class embedding. We use λ = 10 as suggested in (Gulrajani et al., 2017) and β = 0.01 across all the datasets. 5.4.1 Comparing with State-of-the-Art In a first set of experiments, we evaluate our f-xGAN features in both the ZSL and GZSL settings on four challenging datasets: CUB, FLO, SUN and AWA. Unless it is stated otherwise, we use att for CUB, SUN, AWA and stc for FLO (as att are not available). We compare the effect of our feature generating f-xGAN to 6 recent state-of-the-art methods (Xian et al., 2017). ZSL with f-CLSWGAN. We first provide ZSL results with our f-CLSWGAN in Table 5.2 (left). Here, the test-time search space is restricted to unseen classes Yu. First, our f-CLSWGAN in all cases improves the state of the art that is obtained without feature generation. The overall accuracy improvement on CUB is from 54.9% to 61.5%, on FLO from 53.4% to 71.2%, on SUN from 58.1% to 62.1% and on AWA from 65.6% to 69.9%, i.e. all quite significant. Another observation is that feature generation is applicable to all the multimodal embedding models and softmax. These 88chapter 5. feature generating networks for zero-shot image classification DEVISE SJE LATEM ESZSL ALE Softmax Classification Model 0 20 40 60 80 100 T op -1 A cc . ( in % ) FLO NONE f-GAN f-GMMN f-WGAN f-CLSWGAN DEVISE SJE LATEM ESZSL ALE Softmax Classification Model 0 20 40 60 80 100 T op -1 A cc . ( in % ) CUB NONE f-GAN f-GMMN f-WGAN f-CLSWGAN Figure 5.3: Zero-shot learning results when comparing f-xGAN versions with f- GMMN as well as comparing multimodal embedding methods with softmax. results demonstrate that indeed our f-CLSWGAN generates generalizable and strong visual features of previously unseen classes. GZSL with f-CLSWGAN. Our main interest is GZSL where the test time search space contains both seen and unseen classes, Ys ∪Yu, and at test time the images come both from seen and unseen classes. Therefore, we evaluate both seen and unseen class accuracy, i.e. s and u, as well as their harmonic mean (H). The GZSL results with f-CLSWGAN in Table 5.2 (right) demonstrate that for all datasets our f-xGAN significantly improves the H-measure over the state-of-the-art. On CUB, f-CLSWGAN obtains 49.7% in H measure, significantly improving the state of the art (34.4%), on FLO it achieves 65.6% (vs. 21.9%), on SUN it reaches 39.4% (vs. 26.3%), and on AWA it achieves 59.6% (vs. 27.5%). The accuracy boost can be attributed to the strength of the f-CLSWGAN generator learning to imitate CNN features of unseen classes although not having seen any real CNN features of these classes before. We also observe that without feature generation on all models the seen class accuracy is significantly higher than unseen class accuracy, which indicates that many samples are incorrectly assigned to one of the seen classes. Feature generation through f-CLSWGAN finds a balance between seen and unseen class accuracies by improving the unseen class accuracy while maintaining the accuracy on seen classes. Furthermore, we would like to emphasize that the simple softmax classifier beats all the models and is now applicable to GZSL thanks to our CNN feature generation. This shows the true potential and generalizability of feature generation to various tasks. ZSL and GZSL with f-xGAN. The generative model is an important component of our framework. Here, we evaluate all versions of our f-xGAN and f-GMMN for it being a strong alternative. We show ZSL and GZSL results of all classification models in Figure 5.3 and Figure 5.4 respectively. We selected CUB and FLO for them 5.4 experiments 89 DEVISE SJE LATEM ESZSL ALE Softmax Classification Model 0 20 40 60 80 100 H ar m on ic M ea n (i n % ) FLO NONE f-GAN f-GMMN f-WGAN f-CLSWGAN DEVISE SJE LATEM ESZSL ALE Softmax Classification Model 0 20 40 60 80 100 H ar m on ic M ea n (i n % ) CUB NONE f-GAN f-GMMN f-WGAN f-CLSWGAN Figure 5.4: Generalized zero-shot learning results when comparing f-xGAN versions with f-GMMN as well as comparing multimodal embedding methods with softmax. being fine-grained datasets, however we provide full numerical results and plots in the supplementary which shows that our observations hold across datasets. Our first observation is that for both ZSL and GZSL settings all generative models improve in all cases over “none” with no access to the synthetic CNN features. This applies to the GZSL setting and the difference between “none” and f-xGAN is strikingly significant. Our second observation is that our novel f-CLSWGAN model is the best performing generative model in almost all cases for both datasets. Our final observation is that although f-WGAN rarely performs lower than f-GMMN, e.g. ESZL on FLO, our f-CLSWGAN which uses a classification loss in the generator recovers from it and achieves the best result among all these generative models. We conclude from these experiments that generating CNN features to support the classifier when there is missing data is a technique that is flexible and strong. 5.4.2 Analyzing f-xGAN Under Different Conditions In this section, we analyze f-xGAN in terms of stability, generalization, CNN archi- tecture used to extract real CNN features and the effect of class embeddings on two fine-grained datasets, namely CUB and FLO. Stability and Generalization. We first analyze how well different generative models fit the seen class data used for training. Instead of using Parzen window-based log- likelihood (Goodfellow et al., 2014) that is unstable, we train a softmax classifier with generated features of seen classes and report the classification accuracy on a held-out test set. Figure 5.5 shows the classification accuracy w.r.t the number of training epochs. On both datasets, we observe a stable training trend. On FLO, compared to the supervised classification accuracy obtained with real images, i.e. the upper bound marked with dashed line, f-GAN remains quite weak even after convergence, which indicates that f-GAN has underfitting issues. A strong alternative is f-GMMN 90chapter 5. feature generating networks for zero-shot image classification 0 100 200 300 Epoch 0 20 40 60 80 100 T op -1 A cc . ( in % ) FLO f-GAN f-GMMN f-WGAN f-CLSWGAN Real Data 0 100 200 300 Epoch 0 20 40 60 80 100 T op -1 A cc . ( in % ) CUB f-GAN f-GMMN f-WGAN f-CLSWGAN Real Data Figure 5.5: Measuring the seen class accuracy of the classifier trained on generated features of seen classes w.r.t. the training epochs (with softmax). CNN FG u s H GoogLeNet none 20.2 35.7 25.8 f-CLSWGAN 35.3 38.7 36.9 ResNet-101 none 23.7 62.8 34.4 f-CLSWGAN 43.7 57.7 49.7 Table 5.3: GZSL results with GoogLeNet vs ResNet-101 features on CUB (CNN: Deep Feature Encoder Network, FG: Feature Generator, u = T1 on Yu, s = T1 on Ys, H = harmonic mean, “none”= no generated features). leads to a significant accuracy boost while our f-WGAN and f-CLSWGAN improve over f-GMMN and almost reach the supervised upper bound. After having established that our f-xGAN leads to a stable training performance and generating highly descriptive features, we evaluate the generalization ability of the f-xGAN generator to unseen classes. Using the pre-trained model, we generate CNN features of unseen classes. We then train a softmax classifier using these synthetic CNN features of unseen classes with real CNN features of seen classes. On the GZSL task, Figure 5.6 shows that increasing the number of generated features of unseen classes from 1 to 100 leads to a significant boost of accuracy, e.g. 28.2% to 56.5% on CUB and 37.9% to 66.5% on FLO. As in the case for generating seen class features, here the ordering is f-GAN < f-WGAN < f-GMMN < f-CLSWGAN on CUB and f-GAN < f-GMMN < f-WGAN < f-CLSWGAN on FLO. With these results, we argue that if the generative model can generalize well to previously unseen data distributions, e.g. perform well on GZSL task, they have practical use in a wide range of real-world applications. Hence, we propose to quantitatively evaluate the performance of generative models on the GZSL task. 5.4 experiments 91 1 2 6 10 30 50 100 200 300 # of generated features per class 10 20 30 40 50 60 70 T op -1 A cc . ( in % ) FLO f-GAN f-GMMN f-WGAN f-CLSWGAN 1 2 6 10 30 50 100 200 300 # of generated features per class 10 20 30 40 50 60 70 T op -1 A cc . ( in % ) CUB f-GAN f-GMMN f-WGAN f-CLSWGAN Figure 5.6: Increasing the number of generated f-xGAN features wrt unseen class accuracy (with softmax) in ZSL. C FG u s H Attribute (att) none 23.7 62.8 34.4 f-CLSWGAN 43.7 57.7 49.7 Sentence (stc) none 38.8 53.8 45.1 f-CLSWGAN 50.3 58.3 54.0 Table 5.4: GZSL results with conditioning f-xGAN with stc and att on CUB (C: Class embedding, FG: Feature Generator, u = T1 on Yu, s = T1 on Ys, H = harmonic mean, “none”= no generated features). Effect of CNN Architectures. The aim of this study is to determine the effect of the deep CNN encoder that provides real features to our f-xGAN discriminator. In Table 5.3, we first observe that with GoogLeNet features, the results are lower com- pared to the ones obtained with ResNet-101 features. This indicates that ResNet-101 features are stronger than GoogLeNet, which is expected. Besides, most importantly, with both CNN architectures we observe that our f-xGAN outperforms the “none” by a large margin. Specifically, the accuracy increases from 25.8% to 36.9% for GoogleNet features and 34.4% to 49.7% for ResNet-101 features. Those results are encouraging as they demonstrate that our f-xGAN is not limited to learning the distribution of ResNet-101 features, but also able to learn other feature distributions. Effect of Class Embeddings. The conditioning variable, i.e. class embedding, is an important component of our f-xGAN. Therefore, we evaluate two different class embeddings, per-class attributes (att) and per-class sentences (stc) on CUB as this is the only dataset that has both. In Table 5.4, we first observe that f-CLSWGAN features generated with att not only lead to a significantly higher result (49.7% vs 34.4%), s and u are much more balanced (57.7% and 43.7% vs. 62.8% and 23.7%) 92chapter 5. feature generating networks for zero-shot image classification 2H 3H M500 M1K M5K L500 L1K L5K All T o p -1 A cc . (i n % ) 0 2 4 6 8 10 12 14 16 ZSL ALE Ours 2H 3H M500 M1K M5K L500 L1K L5K All T o p -1 A cc . (i n % ) 0 1 2 3 4 5 GZSL ALE Ours Figure 5.7: ZSL and GZSL results on ImageNet (ZSL: T1 on Yu, GZSL: T1 on Yu). The splits, ResNet features and Word2Vec are provided by (Xian et al., 2017). “Ours” = feature generator: f-CLSWGAN, classifier: softmax. compared to the state-of-the-art, i.e. “none”. This is because generated CNN features help us explore the space of unseen classes whereas the state of the art learns to project images closer to seen class embeddings. Finally, f-CLSWGAN features generated with per-class stc significantly improve results over att, achieving 54.0% in H measure, and also leads to a notable u of 50.3% without hurting s (58.3%). This is due to the fact that stc leads to high quality features (Reed et al., 2016a) reflecting the highly descriptive semantic content language entails and it shows that our f-CLSWGAN is able to learn higher quality CNN features given a higher quality conditioning signal. 5.4.3 Large-Scale Experiments Our large-scale experiments follow the same zero-shot data splits of (Xian et al., 2017) and serve two purposes. First, we show the generalizability of our approach by conducting ZSL and GZSL experiments on ImageNet (Deng et al., 2009) for it being the largest-scale single-label image dataset, i.e. with 21K classes and 14M images. Second, as ImageNet does not contain att, we use as a (weak) conditioning signal Word2Vec (Mikolov et al., 2013b) to generate f-CLSWGAN features. Figure 6.3 shows that softmax as a classifier obtains the state-of-the-art of ZSL and GZSL on ImageNet, significantly improving over ALE (Akata et al., 2015a). These results show that our f-CLSWGAN is able to generate high quality CNN features also with Word2Vec as the class embedding. For ZSL, for instance, with the 2H split “Ours” almost doubles the performance of ALE (5.38% to 10.00%) and in one of the extreme cases, e.g. with L1K split, the accuracy improves from 2.85% to 3.62%. For GZSL the same observations hold, i.e. the gap between ALE and “Ours” is 2.18 vs 4.38 with 2H split and 1.21 vs 2.50 with L1K split. Note that, (Xian et al., 2017) reports the highest results with 5.4 experiments 93 CUB FLO Generated Data u s H u s H none 38.8 53.8 45.1 13.3 61.6 21.9 Image (with (Zhang et al., 2017a)) 0.2 69.4 0.4 10.5 95.4 18.9 CNN feature (Ours) 50.3 58.3 54.0 59.0 73.8 65.6 Table 5.5: Summary Table (u = T1 on Yu, s = T1 accuracy on Ys, H = harmonic mean, class embedding = stc). “none”: ALE with no generated features. SYNC (Changpinyo et al., 2016) and “Ours” improves over SYNC as well, e.g. 9.26% vs 10.00% with 2H and 3.23% vs 3.56% with L1K. With these results we emphasize that with a supervision as weak as a Word2Vec signal, our model is able to generate CNN features of unseen classes and operate at the ImageNet scale. This does not only hold for the ZSL setting which discards all the seen classes from the test-time search space assuming that the evaluated images will belong to one of the unseen classes. It also holds for the GZSL setting where no such assumption has been made. Our model generalizes to previously unseen classes even when the seen classes are included in the search space which is the most realistic setting for image classification. 5.4.4 Feature vs Image Generation As our main goal is solving the GZSL task which suffers from the lack of visual training examples, one naturally thinks that image generation serves the same purpose. Therefore, here we compare generating images and image features for the task of GZSL. We use the StackGAN (Zhang et al., 2017a) to generate 256 × 256 images conditioned on sentences. In Table 5.5, we compare GZSL results obtained with “none”, i.e. with an ALE model trained on real images of seen classes, Image, i.e. image features extracted from 256 × 256 synthetic images generated by StackGAN (Zhang et al., 2017a) and CNN feature, i.e. generated by our f-CLSWGAN. Between “none” and “Image”, although the seen class accuracy improves, the unseen class accuracy is extremely low (0.2% for CUB and 10.5% for FLO) which shows that the generated images do not generalize to unseen classes. On average, i.e. the H measure, generating images of unseen classes leads to 0.4% on CUB and 18.9% accuracy on FLO whereas “none” leads to 45.1% on CUB and 21.9% accuracy on FLO. Upon visual inspection, we have observed that although many images have an accurate visual appearance as birds or flowers, they lack the necessary discriminative details to be classified correctly and the generated images are not class-consistent. On the other hand, generating CNN features leads to a significant boost of accuracy, e.g. 54.0% on CUB and 65.6% on FLO which is clearly higher than having no generation, i.e. “none”, and image generation. We argue that image feature generation has the following advantages. First, the 94chapter 5. feature generating networks for zero-shot image classification number of generated image features is limitless. Second, the image feature generation learns from compact invariant representations obtained by a deep network trained on a large-scale dataset such as ImageNet, therefore the feature generative network can be quite shallow and hence computationally efficient. Third, generated CNN features are highly discriminative, i.e. they lead to a significant boost in performance of both ZSL and GZSL. Finally, image feature generation is a much easier task as the generated data is much lower dimensional than high quality images necessary for discrimination. 5.5 conclusion In this work, we propose f-CLSWGAN, a learning framework for feature generation followed by classification, to tackle the generalized zero-shot learning task. Our f-CLSWGAN model adapts the conditional GAN architecture that is frequently used for generating image pixels to generate CNN features. In f-CLSWGAN, we improve WGAN by adding a classification loss on top of the generator, enforcing it to generate features that are better suited for classification. In our experiments, we have shown that generating features of unseen classes allows us to effectively use softmax classifiers for the GZSL task. Our framework is generalizable as it can be integrated to various deep CNN archi- tectures, i.e. GoogleNet and ResNet as a pair of the most widely used architectures. It can also be deployed with various classifiers, e.g. ALE, SJE, DEVISE, LATEM, ESZSL that constitute the state of the art for ZSL but also the GZSL accuracy improvements obtained with softmax is important as it is a simple classifier that could not be used for GZSL before this work. Moreover, our features can be generated via different sources of class embeddings, e.g. Sentence, Attribute, Word2vec, and applied to different datasets, i.e. CUB, FLO, SUN, AWA being fine and coarse-grained ZSL datasets and ImageNet being a truly large-scale dataset. Finally, based on the success of our framework, we motivated the use of GZSL tasks as an auxiliary method for evaluation of the expressive power of generative models in addition to manual inspection of generated image pixels which is tedious and prone to errors. For instance, WGAN (Gulrajani et al., 2017) has been proposed and accepted as an improvement over GAN (Goodfellow et al., 2014). This claim is supported with evaluations based on manual inspection of the images and the inception score. Our observations in Figure 5.4 and in Figure 5.6 support this and follow the same ordering of the models, i.e. WGAN improves over GAN in ZSL and GZSL tasks. Hence, while not being the primary focus of this chapter, we strongly argue, that ZSL and GZSL are suited well as a testbed for comparing generative models. 6E N H A N C E D F E A T U R E G E N E R A T I O N F R A M E W O R K S F O R L O W - S H O T L E A R N I N G Contents 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.3 f-VAEGAN-D2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.3.1 Baseline Feature Generating Models . . . . . . . . . . . . . . 99 6.3.2 Our f-VAEGAN-D2 Model . . . . . . . . . . . . . . . . . . . . 99 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.4.1 (Generalized) Zero-shot Learning . . . . . . . . . . . . . . . . 101 6.4.2 (Generalized) Few-shot Learning . . . . . . . . . . . . . . . . 103 6.4.3 Interpreting Synthesized Features . . . . . . . . . . . . . . . 106 6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 I n Chapter 5, we show that feature generation is an effective way to tackle the data imbalance issue. Therefore in this chapter, we extend this idea to any-shot learning i.e., few-shot and zero-shot learning. We improve the feature generator f-CLSWGAN of Chapter 5 in two ways. First, our combine GANs and VAE to construct a stronger generative model. Second, our model additionally adds a descriminator that learns marginal distribution of novel classes from their unlabeled examples. Our proposed approach achieves the SOTA on the zero-shot learning benchmark introduced in Chapter 4. The previous chapters including this chapter are all about image classification. In the next two chapters, we will move our attention to more complicated tasks includ- ing the semantic segmentation in Chapter 7 and video classification in Chapter 8 in the context of zero-shot and few-shot learning. 6.1 introduction Learning with limited labels has been an important topic of research as it is unrealistic to collect sufficient amounts of labeled data for every object. Recently, generating visual features of previously unseen classes (e.g. Xian et al., 2018; Bucher et al., 2017; Kumar Verma et al., 2018b; Felix et al., 2018a) has shown its potential to perform well on extremely imbalanced image collections. However, current feature generation approaches have still shortcomings. First, they rely on simple generative models which are not able to capture complex data distributions. Second, in many cases, they do not truly generalize to the under represented classes. Third, although classifiers 95 96chapter 6. enhanced feature generation frameworks for low-shot learning … it has a brown center and yellow petals . Textual Explanation … because it has a brown center and yellow petals . This flower has a large brown center and its petals are long. f-VAEGAN-D2 Sunflower Visualization Transductive Learning (D2) Feature Generator (f-WGAN) Feature Reconstruction (f-VAE)real image ∈ This is a sunflower because ... feature space Figure 6.1: Our any-shot feature generating framework learns discriminative and interpretable CNN features from both labeled data of seen and unlabeled data of novel classes. trained on a combination of real and generated features obtain state-of-the-art results, generated features may not be easily interpretable. Our main focus in this work is a new model that generates visual features of any class, utilizing labeled samples when they are available and generalizing to unknown concepts whose labeled samples are not available. Prior work used GANs for this task (Xian et al., 2018; Felix et al., 2018a) as they directly optimize the divergence between real and generated data, but they suffer from mode collapse issues (Arjovsky and Bottou, 2017). On the other hand, feature generation with VAE (Kumar Verma et al., 2018b) is more stable. However, VAE optimizes the lower bound of log likelihood rather than the likelihood itself (Kingma and Welling, 2014). Our model combines the strengths of VAE and GANs by assembling them to a condi- tional feature generating model, called f-VAEGAN-D2, that synthesizes CNN image features from class embeddings, i.e. class-level attributes or word2vec (Mikolov et al., 2013b). Thanks to its additional discriminator that distinguishes real and generated features, our f-VAEGAN-D2 is able to use unlabeled data from previously unseen classes without any condition. The features learned by our model, e.g. Figure 8.1, are disciminative in that they boost the performance of any-shot learning as well as being visually and textually interpretable. Our main contributions are as follows. (1) We propose the f-VAEGAN-D2 model that consists of a conditional encoder, a shared conditional decoder/generator, a conditional discriminator and a non-conditional discriminator. The first three networks aim to learn the conditional distribution of CNN image features given class embeddings optimizing VAE and WGAN losses on labeled data of seen classes. The 6.2 related work 97 Encoder (E) D ec od er /G en er at or (G ) Cape May Warbler Discriminator1 (D1) Discriminator2 (D2) VAE GAN D2 D2 f-WGAN f-VAE Figure 6.2: Our any-shot feature generating network (f-VAEGAN-D2) consist of a feature generating VAE (f-VAE), a feature generating WGAN (f-WGAN) with a conditional discriminator (D1) and a transductive feature generator with a non- conditional discriminator (D2) that learns from both labeled data of seen classes and unlabeled data of novel classes. last network learns the marginal distribution of CNN image features on the unlabeled features of novel classes. Once trained, our model synthesizes discriminative image features that can be used to augment softmax classifier training. (2) Our empirical analysis on CUB, AWA2, SUN, FLO, and large-scale ImageNet shows that our generated features improve the state-of-the-art in low-shot regimes, i.e. (generalized) zero- and few shot learning in both the inductive and transductive settings. (3) We demonstrate that our generated features are interpretable by inverting them back to the raw pixel space and by generating visual explanations. 6.2 related work In this section, we discuss related works on generative models. We will not repeat the zero-shot and few-shot learning works that have been discussed in Chapter 2. Generative Models. Generative modeling aims to learn the probability distribution of data points such that we can randomly sample data from it that can be used as a data augmentation mechanism. Generative Adversarial Networks (GANs)(Goodfellow et al., 2014; Mirza and Osindero, 2014; Radford et al., 2016) consist of a generator that synthesizes fake data and a discriminator that distinguishes fake and real data. The instable training issues of GANs have been studied by (Gulrajani et al., 2017; Arjovsky and Bottou, 2017; Miyato et al., 2018). An interesting application of GANs is CycleGAN (Zhu et al., 2017) that translates an image from one domain to another domain. (Reed et al., 2016c) generates natural images from text descriptions, and SRGAN(Ledig et al., 2017) solves single image super-resolution. Variational Autoen- coder (VAE) (Kingma and Welling, 2014) employs an encoder that represents the input as a latent variable with Gaussian distribution assumption and a decoder that 98chapter 6. enhanced feature generation frameworks for low-shot learning reconstructs the input from the latent variable. GMMN (Li et al., 2015) optimizes the maximum mean discrepancy (MMD) (Gretton et al., 2007) between real and gener- ated distribution. Recently, generative models (Bucher et al., 2017; Zhu et al., 2018b; Kumar Verma et al., 2018b; Xian et al., 2018) have been applied to solve generalized zero-shot learning by synthesizing CNN features of unseen classes from semantic embeddings. Among those, (Bucher et al., 2017) uses GMMN (Li et al., 2015), (Zhu et al., 2018b; Xian et al., 2018) use GANs(Goodfellow et al., 2014) and (Kumar Verma et al., 2018b) employs VAE (Kingma and Welling, 2014). Our model combines the advantages of both VAE and GAN with an additional discriminator to use unlabeled data of unseen classes which lead to more discriminative features. 6.3 f-vaegan-d2 model Existing models that operate on sparse data regimes are either trained with labeled data from a set of classes which is disjoint from the set of classes at test time, i.e. inductive zero-shot setting (e.g. Lampert et al., 2013; Frome et al., 2013), or the sam- ples can come from all classes but then their labels are not known, i.e. transductive zero-shot setting (e.g. Fu et al., 2015a; Rohrbach et al., 2013). Recent works (e.g. Xian et al., 2018; Kumar Verma et al., 2018b; Felix et al., 2018a) address generalized zero-shot learning by generating synthetic CNN features of unseen classes followed by training softmax classifiers, which alleviates the imbalance between seen and unseen classes. However, we argue that those feature generating approaches are not expressive enough to capture complicated feature distributions in real world. In addition, since they have no access to any real unseen class features, there is no guarantee on the quality of generated unseen class features. As shown in Figure 7.2, we proposes to enhance the feature generator by combining VAE and GANs with shared decoder and generator, and adding another discriminator (D2) to distinguish real or generated features without applying any condition. Intuitively, in transduc- tive zero-shot setting, by feeding real unlabeled features of unseen classes, D2 will be able to learn the manifold of unseen class such that more realistic features can be generated. Hence, the key to our approach is the ability to generate semantically rich CNN feature distributions, which is generalizes to any-shot learning scenar- ios ranging from (generalized) zero-shot to (generalized) few-shot to (generalized) many-shot learning. Setup. We are given a set of images X = {x1, . . . , xl}∪{xl+1, . . . , xt} encoded in the image feature space X , a seen class label set Ys, a novel label set Yn, a.k.a unseen class label set Yu in the zero-shot learning literature. The set of class embeddings C = {c(y)|∀y ∈ Ys ∪Yn} are encoded in the semantic embedding space C that defines high level semantic relationships between classes. The first l points xs(s ≤ l) are labeled as one of the seen classes ys ∈ Ys and the remaining points xn(l + 1 ≤ n ≤ t) are unlabeled, i.e. may come from seen or novel classes. In the inductive setting, the training set contains only labeled samples of seen class images, i.e. {x1, . . . , xl}. On the other hand, in the transductive setting, the 6.3 f-vaegan-d2 model 99 training set contains both labeled and unlabeled samples, i.e. {x1, . . . , xl , xl+1, . . . , xt}. For both inductive and transductive settings the inference is the same. In zero-shot learning, the task is to predict the label of those unlabeled points that belong to novel classes, i.e. fzsl : X →Yn, while in the generalized zero-shot learning, the goal is to classify those unlabeled points that can be either from seen or novel classes, i.e. fgzsl : X → Ys ∪Yn. Few-shot and generalized few-shot learning are defined similarly. Our framework can be thought of as a data augmentation scheme where ar- bitrarily many synthetic features of sparsely populated classes aid in improving the disciminative power of classifiers. In the following, we only detail our feature generating network structure as the classifier is unconstrained (we use linear softmax classifiers). 6.3.1 Baseline Feature Generating Models In feature generating networks (f-WGAN) (Xian et al., 2018) the generator G(z, c) generates a CNN feature x̂ in the input feature space X from random noise zp and a condition c, and the discriminator D(x, c) takes as input a pair of input features x and a condition c and outputs a real value, optimizing: LsW G AN =E[D(x, c)]− E[D(x̃, c)] (6.1) − λE[(||∇x̂ D(x̂, c)||2 − 1) 2 ], where x̃ = G(z, c) is the generated feature and x̂ = αx + (1 − αx) with α ∼ U(0, 1) and λ is the penalty coefficient. The feature generating VAE (Kingma and Welling, 2014) (f-VAE) consists of an encoder E(x, c), which encodes an input feature x and a condition c to a latent variable z, and a decoder Dec(z, c), which reconstructs the input x from the latent z and condition c optimizing: LsVAE = KL(q(z|x, c)||p(z|c)) (6.2) − Eq(z|x,c)[log p(x|z, c)], where the conditional distribution q(z|x, c) is modeled as E(x, c), p(z|c)) is as- sumed to be N(0, 1), KL is the Kullback-Leibler divergence, and p(x|z, c) is equal to Dec(z, c). 6.3.2 Our f-VAEGAN-D2 Model It has been shown that ensembling a VAE and a GAN leads to better image generation results (Larsen et al., 2016). We hypothesize that VAE and GAN learn complementary information for feature generation as well. This is likely when the target data follows a complicated multi-modal distribution where two losses are able to capture different modes of the data. 100chapter 6. enhanced feature generation frameworks for low-shot learning To combine f-VAE and f-WGAN, we introduce an encoder E(x, c) : X ×C →Z, which encodes a pair of feature and class embedding to a latent representation, and a discriminator D1 : X ×C → R maps this embedding pair to a compatibility score, optimizing: LsVAEG AN = L s VAE + γL s W G AN (6.3) where the generator G(z, c) of the GAN and decoder Dec(z, c) of the VAE share the same parameters. The superscript s indicates that the loss is applied to feature and class embedding pair of seen classes. γ is a hyperparameter to control the weighting of VAE and GAN losses. Furthermore, when unlabeled data of novel classes becomes available, we propose to add a non-conditional discriminator D2 (D2 in f-VAEGAN-D2) which distinguishes between real and generated features of novel classes. This way D2 learns the feature manifold of novel classes. Formally, our additional non-conditional discriminator D2 : X → R distinguishes real and synthetic unlabeled samples using a WGAN loss: LnW G AN =E[D2(xn)]− E[D2(x̃n)]− (6.4) λE[(||∇x̂n D2(x̂n)||2 − 1) 2 ], where x̃n = G(z, yn) with yn ∈ Yn, x̂n = αxn + (1 − αxn) with α ∼ U(0, 1). Since LsW G AN is trained to learn CNN features using labeled data conditioned on class embeddings of seen classes and class embeddings encode shared properties across classes, we expect these CNN features to be transferable across seen and novel classes. However, this heavily relies on the quality of semantic embeddings and suffers from domain shift problems. Intuitively, LnW G AN captures the marginal distribution of CNN features and provides useful signals of novel classes to generate transferable CNN features. Hence, our unified f-VAEGAN-D2 model optimizes the following objective function: min G,E max D1,D2 LsVAEG AN +L n W G AN (6.5) Implementation Details. Our generator (G) and discriminators (D1 and D2) are implemented as multilayer perceptron (MLP). The random Gaussian noise z ∼ N(0, 1) and class embedding c(y) are concatenated and fed into the generator, which is composed of 2 fully connected layers with 4096 hidden units. We find dimension of noise dz = dc, i.e. dimension of class embeddings, works well. Similarly, the discriminators take input as the concatenation of image feature and class embedding and have 2 fully connected layers with 4096 hidden units. We use LeakyReLU as the nonlinear activation function except for the output layer of G, for which Sigmoid is used because we apply binary cross entropy loss as LREC and input features are rescaled to be in [0, 1]. We find β = 1 and γ = 1000 works well across all the datasets. Gradient penalty coefficient is set to λ = 10 and generator is updated every 5 discriminator iterations as suggested in WGAN paper (Arjovsky et al., 2017). As for the optimization, we use Adam optimizer with constant learning rate 0.001 and early stopping on the validation set. 6.4 experiments 101 Model ZSL GZSL INDUCTIVE GAN 59.1 52.3 VAE 58.4 52.5 VAE-GAN 61.0 53.7 TRANSDUCTIVE GAN 67.3 61.6 VAE 68.9 59.6 VAE-GAN 71.1 63.2 Table 6.1: Ablating different generative models on CUB (using attribute class em- bedding and image features with no fine-tuning). ZSL: top-1 accuracy on unseen classes, GZSL: harmonic mean of seen and unseen class accuracies. 6.4 experiments In this section, we validate our approach in both zero-shot and few-shot learning. The details of the settings are provided in their respective sections. 6.4.1 (Generalized) Zero-shot Learning We validate our model on five widely-used datasets for zero-shot learning, i.e. Caltech-UCSD-Birds (CUB) (Welinder et al., 2010), Oxford Flowers (FLO) (Nilsback and Zisserman, 2008), SUN Attribute (SUN) (Patterson and Hays, 2012) and Animals with Attributes2 (AWA2) (Xian et al., 2019b). Among those, CUB, FLO and SUN are medium scale, fine-grained datasets. AWA2, on the other hand, is a coarse- grained dataset. Finally we evaluate our model also on ImageNet (Deng et al., 2009) with more than 14 million images and 21K classes as a large-scale and fine-grained dataset. We follow the exact ZSL and GZSL splits as well as the evaluation protocol of (Xian et al., 2019b) and for fair comparison we use the same image and class embeddings for all models. Briefly, image (with no image cropping or flipping) features are extracted from the 2048-dim top pooling units of 101-layer ResNet pretrained on ImageNet 1K. For comparative studies, we also fine-tune ResNet-101 on the seen class images of each dataset. As for class embeddings, unless otherwise specified, we use class-level attributes for CUB (312-dim), AWA2 (85-dim) and SUN(102-dim). For CUB and FLO, we also extract 1024-dim sentence embeddings of character-based CNN-RNN model (Reed et al., 2016a) from fine-grained visual descriptions (10 sentences per image). Ablation study. We ablate our model with respect to the generative model, i.e. using GAN, VAE or VAE-GAN in both inductive and transductive settings. Our conclusions from Table 8.4, are as follows. In the inductive setting VAE-GAN has an edge over both VAE and GAN, i.e. 59.1% and 58.4% vs 61.0% in ZSL setting. 102chapter 6. enhanced feature generation frameworks for low-shot learning Zero-Shot Learning Generalized Zero-Shot Learning CUB FLO SUN AWA CUB FLO SUN AWA Method T1 T1 T1 T1 u s H u s H u s H u s H IND ALE 54.9 48.5 58.1 59.9 23.7 62.8 34.4 13.3 61.6 21.9 21.8 33.1 26.3 16.8 76.1 27.5 CLSWGAN 57.3 67.2 60.8 68.2 43.7 57.7 49.7 59.0 73.8 65.6 42.6 36.6 39.4 57.9 61.4 59.6 SE-GZSL 59.6 - 63.4 69.2 41.5 53.3 46.7 - - - 40.9 30.5 34.9 58.3 68.1 62.8 Cycle-CLSWGAN 58.6 70.3 59.9 66.8 47.9 59.3 53.0 61.6 69.2 65.2 47.2 33.8 39.4 59.6 63.4 59.8 Ours 61.0 67.7 64.7 71.1 48.4 60.1 53.6 56.8 74.9 64.6 45.1 38.0 41.3 57.6 70.6 63.5 Ours-finetuned 72.9 70.4 65.6 70.3 63.2 75.6 68.9 63.3 92.4 75.1 50.1 37.8 43.1 57.1 76.1 65.2 TRAN ALE-tran 54.5 48.3 55.7 70.7 23.5 45.1 30.9 13.6 61.4 22.2 19.9 22.6 21.2 12.6 73.0 21.5 GFZSL 50.0 85.4 64.0 78.6 24.9 45.8 32.2 21.8 75.0 33.8 0.0 41.6 0.0 31.7 67.2 43.1 DSRL 48.7 57.7 56.8 72.8 17.3 39.0 24.0 26.9 64.3 37.9 17.7 25.0 20.7 20.8 74.7 32.6 UE-finetune 72.1 - 58.3 79.7 74.9 71.5 73.2 - - - 33.6 54.8 41.7 93.1 66.2 77.4 Ours 71.1 89.1 70.1 89.8 61.4 65.1 63.2 78.7 87.2 82.7 60.6 41.9 49.6 84.8 88.6 86.7 Ours-finetuned 82.6 95.4 72.6 89.3 73.8 81.4 77.3 91.0 97.4 94.1 54.2 41.8 47.2 86.3 88.7 87.5 Table 6.2: Comparing with the-state-of-the-art. Top: inductive methods (IND), Bottom: transductive methods (TRAN). Fine tuning is performed only on seen class images as this does not violate the zero-shot condition. We measure top-1 accuracy (T1) in ZSL setting, Top-1 accuracy on seen (s) and unseen (s) classes as well as their harmonic mean (H) in GZSL setting. Adding unlabeled samples to the training set, i.e. transductive learning setting, is beneficial for all the generative models. As in the inductive setting VAE and GAN achieve similar results, i.e 67.3% and 68.9% for ZSL. Our VAE-GAN model leads to the state-of-the-art results, i.e. 71.1% in ZSL and 63.2% in GZSL confirming that VAE and GAN learn complementary representations. As VAE-GAN gives the highest accuracy in all settings, it is employed in all remaining results of the chapter. Comparing with the state-of-the-art. In Table 6.2 we compare our model with the best performing recent methods on four zero-shot learning datasets on ZSL and GZSL settings. In the inductive ZSL setting, our model both with and without fine-tuning outperforms the state-of-the art for all datasets. Our model with fine-tuned features establishes the new state-of-the-art, i.e. 72.9% on CUB, 70.4% on FLO, 65.6% on SUN and 70.3% on AWA. For the transductive ZSL setting, our model without fine-tuning on CUB is surpassed by UE-finetune of (Song et al., 2018), i.e. 71.1% vs 72.1%. However, when we also fine-tune our features, we establish the new state-of-the-art on the transductive ZSL setting as well, i.e. 82.6% on CUB, 95.4% on FLO, 72.6% on SUN and 89.3% on AWA. In the GZSL setting, we observe that feature generating methods, i.e. our model, CLSWGAN (Xian et al., 2018), SE-GZSL (Kumar Verma et al., 2018b), Cycle- CLSWGAN (Felix et al., 2018a) achieve better results than others. This is due to the fact that data augmentation through feature generation leads to a more balanced data distribution such that the learned classifier is not biased to seen classes. Note that although UE (Song et al., 2018) is not a feature generating method, it leads to strong results as this model uses additional information, i.e. it assumes that unlabeled test samples always come from unseen classes. Nevertheless, our model 6.4 experiments 103 2H 3H M500 M1K M5K L500 L1K L5K All 0 2 4 6 8 10 12 14 16 T o p -1 A cc . (i n % ) CONSE CMT LATEM ALE DEVISE SJE ESZSL SYNC SAE 2H 3H M500 M1K M5K L500 L1K L5K All 0 0.5 1 1.5 2 2.5 3 T o p -1 A cc . (i n % ) CONSE CMT LATEM ALE DEVISE SJE ESZSL SYNC SAE Figure 6.3: Top-1 ZSL results on ImageNet. We follow the splits in (Xian et al., 2019b) and compare our results with the state-of-the-art feature generating model CLSWGAN (Xian et al., 2018). with fine-tuning leads to 77.3% harmonic mean (H) on CUB, 94.1% H on FLO, 47.2% H on SUN and 87.5% H on AWA achieving significantly higher results than all the prior works. Large-scale experiments. Although most of the prior work presented in Table 6.2 has not been evaluated in ImageNet, this dataset serves a challenging and interesting test bed for (G)ZSL research. Hence, we compare our model with CLSWGAN (Xian et al., 2018) on ImageNet using the same evaluation protocol. As shown in Figure 6.3 our model significantly improves over the state-of-the-art in both ZSL and GZSL settings in 2H, 3H and All splits determined by considering the classes 2 hops or 3 hops away from 1000 classes of Imagenet as well as all the remaining classes. These experiments are important for two reasons. First, they show that our feature generation model is scalable to the largest scale setting available. Second, our model is applicable to the situations even when human annotated attributes are not available, i.e. for ImageNet classes attributes are not available hence we use per-class word2vec representations. 6.4.2 (Generalized) Few-shot Learning In few-shot or low-shot learning scenarios, classes are divided into base classes that have a large number of labeled training samples and novel classes that contain only few labeled samples per category. In the plain FSL setting, the goal is to achieve good performance on novel classes whereas in GFSL setting good performance must generalize to all classes. Among the classic ZSL datasets, CUB has been used for few-shot learning in (Qi et al., 2018) by taking the first 100 classes as base classes and the rest as novel classes. However, as ImageNet 1K contains some of those novel classes and feature extractors 104chapter 6. enhanced feature generation frameworks for low-shot learning # training samples per class 1 2 5 10 20 T o p -1 A cc . (i n % ) 40 50 60 70 80 90 CUB Ours-tran Ours-ind Imprint[42] Softmax Analogy[20] ALE-tran[57] # training samples per class 1 2 5 10 20 T o p -1 A cc . (i n % ) 70 75 80 85 90 95 100 FLO Ours-tran Ours-ind Imprint[42] Softmax Analogy[20] ALE-tran[57] Figure 6.4: Few-Shot Learning (FSL) results on CUB and FLO with increasing number of training samples per novel class. We report the top-1 accuracy on novel classes. are pretrained on it, we use the class splits from the standard ZSL setting, i.e. 150 base and 50 novel. For FLO we also follow the same class splits as in ZSL. As for features, we use the same fine-tuned ResNet-101 features and attribute class embeddings used in zero-shot learning experiments. For fairness, we repeat all the experiments for (Qi et al., 2018) and (Hariharan and Girshick, 2017) with the same image features. Comparing with the state-of-the-art. As shown in Figure 6.4 and Figure 6.5, both for FSL and GFSL settings and for both datasets, both our inductive and transductive models have a significant edge over all the competing methods when the number of samples from novel classes is small, e.g. 1,2 and 5. This shows that our model generates highly discriminative features even with only few real samples are present. In fact, only with one real sample per class, our model achieves almost the full accuracy obtained with 20 samples per class. Going towards the full supervised learning, e.g. with 10 or 20 samples per class, all methods perform similarly. This is expected since in the setting where a large number of labeled samples per class is available, then a simple softmax classifier that uses real ResNet-101 features achieves the state-of-the-art. In the inductive FSL setting, our model that uses one labeled sample per class reaches the accuracy as softmax that uses five samples per class. In the transductive FSL setting, our model that uses one labeled sample per class reaches the accuracy of softmax obtained with 10 samples per class. Furthermore, the inductive GFSL setting, our model with two samples per class achieves the same accuracy as softmax trained with ten samples per class on CUB. In the transductive GFSL setting, for FLO, for our model only one labeled sample is enough to reach the accuracy obtained with 20 labeled samples with softmax. Note that the same behavior is observed on SUN and AWA as well. Due to space restrictions we present them in the supplementary material. 6.4 experiments 105 # training samples per class 1 2 5 10 20 T o p -1 A cc . (i n % ) 45 50 55 60 65 70 75 80 85 CUB Ours-tran Ours-ind Imprint[42] Softmax Analogy[20] ALE-tran[57] # training samples per class 1 2 5 10 20 T o p -1 A cc . (i n % ) 60 65 70 75 80 85 90 95 100 FLO Ours-tran Ours-ind Imprint[42] Softmax Analogy[20] ALE-tran[57] Figure 6.5: Generalized Few-Shot Learning (GFSL) results on CUB and FLO with increasing number of training samples per novel class. We report the top-1 accuracy on all classes. Large-scale experiments. Regarding few-shot learning results on ImageNet, we follow the procedure in (Hariharan and Girshick, 2017) where 1K ImageNet cat- egories are randomly divided into 389 base and 611 novel classes. To facilitate cross validation, base classes are further split into C1base (193 classes) and C 2 base (196 classes), and novel classes into C1novel (300 classes) and C 2 novel (311 classes). The cross validation of hyperparameters is performed on C1base and C 1 novel and the final results are reported on C2base and C 2 novel . Here, we extract image features from the ResNet-50 pretrained on C1base ∪ C 2 base, which is provided by the benchmark (Hariharan and Girshick, 2017). Since there is no attribute annotation on ImageNet, we use 300-dim word2vec (Mikolov et al., 2013b) embeddings as the class embedding. Following (Wang et al., 2018c), we measure the averaged top-5 accuracy on test examples of novel classes with the model restricted to only output novel class labels, and the averaged top-5 accuracy on test examples of all classes with the model that predicts both base and novel classes. Our baselines are PMN w/G* (Wang et al., 2018c) combining meta-learning and feature generation, analogy generator (Hariharan and Girshick, 2017) learning an analogy-based feature generator and softmax classifier learned with uniform class sampling. For, few-shot learning results in Figure 6.6(left), we observe that our model in the transductive setting, i.e. Ours-tran improves the state-of-the-art PMN w/G* (Wang et al., 2018c) significantly when the number of training samples is small, i.e. 1,2 and 5. Notably, we achieve 60.6% vs 54.7% state-of-the art at 1 shot, 70.3 vs 66.8% at 2 shots. This indicates that our model generates highly discriminative features by leveraging unlabeled data and word embeddings. In the challenging generalized few-shot learning setting (Figure 6.6 right), although PMN /G* (Wang et al., 2018c) is quite strong by applying meta-learning (Snell et al., 2017), our model still achieves comparable results with the state-of-the-art. It is also worth noting that PMN w/G* (Wang et al., 2018c) cannot be directly applied to zero-shot learning. 106chapter 6. enhanced feature generation frameworks for low-shot learning # training samples per class 1 2 5 10 20 T o p -5 A cc . (i n % ) 20 30 40 50 60 70 80 90 FSL Ours-tran Ours-ind PMN w/G*[55] Softmax Analogy[20] # training samples per class 1 2 5 10 20 T o p -5 A cc . (i n % ) 20 30 40 50 60 70 80 90 GFSL Ours-tran Ours-ind PMN w/G*[55] Softmax Analogy[20] Figure 6.6: Few Shot Learning results on ImageNet with increasing number of training samples per novel class (Top-5 Accuracy). Left: FSL setting, Right: GFSL setting. Hence, our approach is more versatile. 6.4.3 Interpreting Synthesized Features In this section, we show that our generated features on FLO are visually discrimina- tive and textually explainable. Visualising generated features. A number of methods (Dosovitskiy and Brox, 2016a; Mahendran and Vedaldi, 2015; Dosovitskiy and Brox, 2016b) have explored strategies to generate images by inverting feature embeddings. We follow a strategy similar to (Dosovitskiy and Brox, 2016a) and train a deep upconvolutional neural network to invert feature embeddings to the image pixel space. We impose a L1 loss between the ground truth image and the inverted image, as well as a perceptual loss, by passing both images through a pre-trained Resnet101, and taking an L2 loss on the feature vectors at conv5 4 and average pooling layers. We also utilize an adversarial loss, by feeding the image and feature embedding to a discriminator, to improve our image quality. Our generator consists of a fully connected layer followed by 5 upconvolutional blocks. Each upconvolutional block contains an Upsampling layer, a 3x3 convolution, BatchNorm and ReLu non-linearity. The final size of the reconstructed image is 64x64. The discriminator processes the image through 4 downsampling blocks, the feature embedding is sent to a linear layer and spatially replicated and concatenated with the image embedding, and this final embedding is passed through a convolutional and sigmoid layer to get the probability that the sample is real or fake. We train this model on all the real feature-image pairs of the 102 classes, and use the trained generator to invert images from synthetic features. In Figure 6.7, we show generated images from real and synthetic features for comparison. We observe that images generated from synthetic features contain the 6.4 experiments 107 … this flower has a wide brown center and tapered yellow petals. … this flower has a wide center and layers of wide, tapered yellow petals. This is a Sunflower because ... … this flower has petals that are white and has a bushy yellow center … the flower is big with white petals, and a bulb of yellow colored anthers. This is a Tree Poppy because ... … this flower has simple rows of overlapping orange petals with a notched tip of yellow stamen in the center. Se en C la ss es U ns ee n C la ss es This is a Marigold because ... … this flower has layers of long tapered pale yellow petals surrounding orange and red stamen. … this flower is pink in color, and has petals that are drooping downward. … this flower has pink petals that are pointed down, and a lot of red stamen in the center This is a Purple Coneflower because ... R S R S … this flower has red petals that have yellow tips. … this flower has petals that are red with yellow edges This is a Blanket Flower because ... … the petals of the flower are light pink, while the anthers are white and yellow. … this flower is pink and white in color, with petals that are rounded. This is a Pink Primrose because ... … the petals on this flower are mostly lavender in color and the inner stamen is the color purple. … this flower is green, white, and purple in color, and has petals that are oval shaped. This is a Passion Flower because ... … this flower has petals that are red with pointy tips … this flower has a lot of very thin red petals and a lot of white stamen on it This is a King Protea because ... C ha lle ng in g C la ss es R S … this flower has wide trumpet shaped purple flowers with a star shape. This is a Canterburry Bells because … … this flower has broad alternating leaves, and its pink colored petals are lighter pink. This is a Sweat Pea because … … the flowers color of the flower are visible. The stamen and pistil from it. This is a Balloon Flower because … … this flower has petals that are pink and white with green pedicel. … the petals on this flower are mostly bulb shaped purple. This is a Cameilla because … … the flower has five purple petals with white stamen and a white pistil. … this red flower has rounded petals and yellow stamen with yellow anthers. … the petals of the flower are layered in layers while the anthers and are yellow in color. Figure 6.7: Interpretability: visualizations by generating images and textual explana- tions from real or synthetic features. For every block, the top is the target, the middle is reconstructed from the real feature (R) of the target, the bottom is reconstructed from a synthetic feature (S) from the same class. We also generate visual explanations conditioned with the predicted class and the reconstructed real or synthetic images. Top (Middle): Features come from seen (unseen) classes. Bottom: classes with a large inter-class variation lead to poorer visualizations and explanations. essential attributes required for classification, such as the general color distribution and sometimes even features like the petal and stamen are visible. Also, the image quality is similar for the images generated from real and synthetic features. Inter- estingly, the synthetic features of unseen classes generated by our model without observing any real features from that class, i.e. “Unseen classes” and “S” row, also yield pleasing reconstructions. As shown in “Challenging Classes” of Figure 6.7, in some cases the generated images from synthetic features lack a certain level of detail, e.g. see images for “Balloon Flower” and in some cases the colors do not match with the real image, e.g. see images for “Sweat Pea”. We noticed that these correspond to classes with high inter class variation. Explaining visual features. We also explore generating textual explanations of our synthetic features. For this, we choose a language model (Hendricks et al., 2016), that produces an explanation of why an image belongs to a particular class, given a feature embedding and a class label. The architecture of our model is similar to (Hendricks et al., 2016), we use a linear layer for the feature embedding, and feed it as the start token for a LSTM. At every step in the sequence, we also feed the class embedding, to produce class relevant captions. The class embedding is obtained by 108chapter 6. enhanced feature generation frameworks for low-shot learning training a LSTM to generate captions from images, and taking the average hidden state for images of that class. A softmax cross entropy loss is imposed on the output using the ground truth caption. Also, a discriminative loss that encourages the generated sentence to belong to the relevant class is imposed by sampling a sentence from the LSTM and sending it to a pre-trained sentence classifier. The model is trained on the dataset from (Reed et al., 2016a). As before, we train this model on all the real feature-caption pairs, and use it to obtain explanations for synthetic features. In Figure 6.7, we show explanations obtained from real and synthetic features. We observe that the model generates image relevant and class specific explanations for synthetic features of both seen and unseen classes. For instance, a “King Protea” feature contains information about “red petals and pointy tips” while “Purple Coneflower” feature has information on “pink in color and petals that are drooping downward” which are the most visually distinguishing properties of this flower. On the other hand, as shown at the bottom of the figure, for classes where image features lack a certain level of detail, the generated explanations have some issues such as repetitions, e.g. “trumpet shaped” and “star shape” in the same sentence and unknown words, e.g. see the explanation for “Balloon Flower”. 6.5 conclusion In this work, we develop a transductive feature generating framework that syn- thesizes CNN image features from a class embedding. Our generated features circumvent the scarceness of the labeled training data issues and allow us to ef- fectively train softmax classifiers. Our framework combines conditional VAE and GAN architectures to obtain a more robust generative model. We further improve VAE-GAN by adding a non-conditional discriminator that handles unlabeled data from unseen classes. The second discriminator learns the manifold of unseen classes and backpropagates the WGAN loss to feature generator such that it generalizes better to generate CNN image features for unseen classes. Our feature generating framework is effective across zero-shot (ZSL), generalized zero-shot (GZSL), few-shot (FSL) and generalized few-shot learning (GFSL) tasks on CUB, FLO, SUN, AWA and large-scale ImageNet datasets. Finally, we show that our generated features are visually interpretable, i.e. the generated images by by inverting features into raw image pixels achieve an impressive level of detail. They are also explainable via language, i.e. visual explanations generated using our features are class-specific. 7 Z E R O - L A B E L A N D F E W - L A B E L S E M A N T I C S E G M E N T A T I O N Contents 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 7.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 7.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 7.3.1 Semantic Projection Network (SPNet) . . . . . . . . . . . . . 113 7.3.2 Baseline: Hinge Visual-Semantic Loss (HVSL) . . . . . . . . 115 7.4 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 7.4.1 Zero-Label Semantic Segmentation Task . . . . . . . . . . . . 116 7.4.2 Few-Label Semantic Segmentation Task . . . . . . . . . . . . 121 7.4.3 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . 123 7.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 I n Chapters 3, 4, 5, and 6, we develop methods and define evaluation protocols for the image classification tasks. However, in fact, the long-tail issue almost appear in many computer vision applications. Semantic segmentation is one of the most fundamental problems in computer vision. As pixel-level labelling in this context is particularly expensive, there have been several attempts to reduce the annotation effort, e.g. by learning from image level labels and bounding box annotations. In this chapter, we take this one step further and propose zero- and few- label learning for semantic segmentation as a new task and propose a benchmark on the challenging COCO-Stuff and PASCAL VOC12 datasets. In the task of zero-label semantic image segmentation no labeled sample of that class was present during training whereas in few-label semantic segmentation only a few labeled samples were present. Solving this task requires transferring the knowledge from previously seen classes to novel classes. Our proposed semantic projection network (SPNet) achieves this by incorporating class-level semantic information into any network designed for semantic segmentation, and is trained in an end-to-end manner. Our model is effective in segmenting novel classes, i.e. alleviating expensive dense annotations, but also in adapting to novel classes without forgetting its prior knowledge, i.e. generalized zero- and few-label semantic segmentation. In Chapter 8, we will take a further step to address the few-shot learning chal- lenges in the video domain. 109 110 chapter 7. zero-label and few-label semantic segmentation classes with many samples classes with few samples Zero-label semantic segmentation Few-label semantic segmentation Training set Test set Semantic knowledge Our prediction Our prediction Figure 7.1: We propose (generalized) zero- and few-label semantic segmentation tasks, i.e. segmenting classes whose labels are not seen by the model during training or the model has a few labeled samples of those classes. To tackle these tasks, we propose a model that transfers knowledge from seen classes to unseen classes using side information, e.g. semantic word embedding trained on free text corpus. 7.1 introduction In semantic image segmentation the aim is assign a label to every pixel in an image by partitioning it into several semantic regions and then learning the appearance of various classes as well as the background. Although deep CNN-based approaches have achieved good performance for this task, they require costly dense annota- tions to learn their numerous parameters. Hence, leveraging weak annotations via image-level labels (Pathak et al., 2015; Papandreou et al., 2015; Oh et al., 2017) or point (Bearman et al., 2016), bounding box (Khoreva et al., 2017), scribble-level anno- tations (Lin et al., 2016) recently gained interest. On the other hand, as humans, we easily learn to recognize a previously unseen, i.e. novel, class by associating it with classes that we know. However, segmenting such novel classes via modern machine learning techniques is still an open problem as this process requires knowledge transfer from known classes to previously unseen ones. Knowledge transfer to novel classes is not a new task. Learning to predict novel classes has been studied extensively in the context of image classification, i.e. zero- shot learning (Lampert et al., 2013; Zhang and Saligrama, 2016; Changpinyo et al., 2016; Akata et al., 2015b). In zero-label semantic segmentation (ZLSS), our aim is to segment previously unseen, i.e. novel, classes, in few-label semantic segmenta- tion (FLSS) these novel classes have a small number of labeled training examples (see Figure 7.1). In this work, we also aim for learning without forgetting the previously seen classes, i.e. generalized ZLSS and FLSS. To achieve these aims, we propose Semantic Projection Network (SPNet) that incorporates semantic word embeddings to an arbitrary semantic segmentation network inspired by the success of zero-shot learning. Prior models that tackle few-shot semantic segmentation (Shaban et al., 7.2 related works 111 2017; Dong and Xing, 2018) operate in the foreground-background segmentation setting. However, in our definition of FLSS the model has to predict all the classes in an image separately, which is more challenging and realistic. Our framework utilizes the similarity between different categories in a semantic segmentation network, enabling it to transfer learned representations to other classes. Consequently, our model is able to segment scenes containing novel classes. Our main contributions are as follows. (1) We introduce the (generalized) zero- label and few-label semantic image segmentation task in a realistic settings inspired by zero-shot learning for image classification. (2) We propose semantic projection network (SPNet), an end-to-end semantic segmentation model which maps each image pixel to a semantic word embedding space where it is projected with a fixed word embedding to class probabilities optimizing the cross-entropy loss. (3) We cre- ate a benchmark for (generalized) zero- and few-label semantic image segmentation with two challenging datasets, i.e. COCO-Stuff and PASCAL-VOC. Our analysis shows that the SPNet model achieves impressive results both quantitatively and qualitatively in (generalized) zero-label and few-label tasks. Furthermore, as a side- product, our model improves the state of the art in zero-shot image classification demonstrating that it successfully generalizes to other tasks. 7.2 related works In this section, we review prior work on semantic segmentation and its combination with zero-shot learning. Related works on zero-shot learning have been extensively discussed in Chapter 2 and will not be repeated here. Semantic segmentation with weak supervision. Modern semantic segmentation systems (Long et al., 2015; Chen et al., 2018; Badrinarayanan et al., 2017) are built on the encoder-decoder networks and trained with densely labeled annotations. Much efforts focus on improving semantic segmentation under fully supervised settings, e.g. adding global context information (Zhao et al., 2017b; Zhang et al., 2018a; Liu et al., 2016), applying graphical models as a post-processing step to refine the output (Zheng et al., 2015; Chen et al., 2018), etc. On the other hand, weakly supervised semantic segmentation, i.e. reducing the annotation effort, has recently gained momentum. As weak supervision, prior works use image-level annotation (Pathak et al., 2015; Papandreou et al., 2015; Oh et al., 2017), point (Bearman et al., 2016), scribble (Lin et al., 2016) and bounding box (Khoreva et al., 2017) annotations. Those methods propagate the supervision to larger regions by measuring objectness (Bearman et al., 2016) and saliency (Oh et al., 2017), or applying graphical models (Lin et al., 2016). Other methods refine the coarse annotated regions to more accurate ones (Khoreva et al., 2017; Papandreou et al., 2015). However, those models still require all the classes to be seen during training, thus cannot easily be adapted to new classes. In contrast, we focus on segmenting completely novel classes. Semantic segmentation of novel classes. The term zero-shot semantic segmentation appears in prior works (Ji et al., 2018a; Zhao et al., 2017a). The aim of (Ji et al., 2018a) 112 chapter 7. zero-label and few-label semantic segmentation is to segment novel actor-action patterns during test time. While (Zhao et al., 2017a) proposes open-vocabulary scene parsing task that segments novel objects by performing hierarchical parsing, we leverage word embeddings to predict the exact unseen classes and address the few-label problem in a unified framework. For few-shot semantic segmentation, previous approaches (Shaban et al., 2017; Rakelly et al., 2018; Dong and Xing, 2018; Zhang et al., 2018b) follow the meta-learning setup (Vinyals et al., 2016; Snell et al., 2017), which uses a support set to predict an query image. However, those approaches are restricted to output a binary mask and fail to segment an image with multiple classes. In contrast, our approach is operating in the more realistic (generalized) few-label semantic segmentation setting, i.e. pixel-level labeling of an image where labels come from both base and novel classes. Semantic embeddings. In learning with limited labels, some form of side infor- mation is required to transfer the knowledge learned from seen classes to unseen classes. One popular form of side information is attributes (Lampert et al., 2013) that, however, require costly expert annotation. Thus, there has been a large group of studies (Akata et al., 2015b; Reed et al., 2016a; Qiao et al., 2016; Ding et al., 2017) utilizing other sources such as Word2vec (Mikolov et al., 2013b), fastText (Joulin et al., 2016a), or hierarchies (Miller, 1995) for building semantic embeddings. In this work, we utilize Word2Vec and fastText as they do not require dataset specific human annotation. 7.3 approach Modern semantic segmentation models are built on fully convolutional encoder- decoder architectures (Chen et al., 2018; Long et al., 2015) that output intermediate feature maps and posteriors for individual classes. However, to segment novel classes these models need to be adapted to transfer knowledge from one class to the other. Such knowledge can be obtained from class-level semantic embeddings associating different classes. Hence, the main insight of our approach is to leverage semantic word embeddings, i.e. word2vec (Mikolov et al., 2013b) or fast-text (Joulin et al., 2016a), to transfer knowledge learned from base classes to novel classes in a two-step process. First, we propose to learn a visual-semantic embedding module that produces intermediate feature maps in the word embedding space. Second, we project those feature maps into class probabilities via a fixed word embedding projection matrix. At test time, by replacing the projection matrix with word embeddings of novel classes, our model is able to segment unseen categories. Our model is trained end-to-end and can be incorporated into any semantic segmentation network, i.e. FCN (Long et al., 2015) and deeplab (Chen et al., 2018). We illustrate our overall pipeline in Figure 7.2. Task formulation. We denote the set of seen classes as S and a disjoint set of unseen classes as U. Let Ds = {(x, y)|x ∈ X , y ∈ Ys} be our labeled training data of seen classes where x is an image in the image space X , y is its corresponding label mask 7.3 approach 113 Semantic Projection GT loss prediction Train Test Visual-semantic Embedding FCN DeepLab ... Segmentation Networks CNN Feature Maps Word Embedding Matrices ={horse, bush, ...} ={cow, grass, ...} Figure 7.2: Our zero-label and few-label semantic segmentation model, i.e. SPNet, consists of two steps: visual semantic embedding and semantic projection. Zero- label semantic segmentation is drawn as an instance of our model. Replacing different components of SPNet, four tasks are addressed (Solid/dashed lines show the training/test procedures respectively). in the dense label mask space Ys ⊂ Sa∗b of seen classes with a and b being the height and the width of the image respectively. Similarly, we define the label mask space of unseen classes as Yu ⊂Ua∗b. In addition, W s ∈ Rdw×|S| and W u ∈ Rdw×|U| denote the word embedding matrices of seen and unseen classes where dw is the word embedding dimension. Given Ds, W s, and W u, the task of zero-label semantic segmentation (ZLSS) is to learn a model that takes an image as an input and predicts the label of each pixel among unseen classes. A more realistic setting is generalized zero-label semantic segmentation (GZLSS) where the learned model predicts both seen and unseen classes. As for the (generalized) few-label semantic segmentation task, a few labeled samples from unseen classes Du = {(x, y)|x ∈ X , y ∈ Yu} are provided to the model during training. The test time target classes include only seen classes in few-label semantic segmentation (FLSS) whereas they include both seen and unseen classes in generalized few-label semantic segmentation (GFLSS). Here, we refer to the classes with a few labeled samples as unseen or novel, interchangeably. We summarize train class, test class and word embeddings used in different settings in Figure 7.2. 7.3.1 Semantic Projection Network (SPNet) We address all four tasks with an unified model SPNet, which consists of two parts: visual-semantic embedding module and semantic projection layer. i. Visual-semantic embedding module. This module is parameterized by a CNN and maps an input image x ∈X into dw feature maps via φ : X →Ra×b×dw of size a × b. This is equivalent to embedding each pixel at (i, j) into a dw dimensional class embedding vector φ(x)ij that lies in the semantic embedding space shared by all the classes. The semantic embedding space constrains the output of the 114 chapter 7. zero-label and few-label semantic segmentation visual-semantic embedding extractor φ and transfers knowledge from seen to unseen classes. Note that this is different from a standard CNN where pixels are mapped into an unconstrained feature space. ii. Semantic projection layer. The semantic projection layer maps the feature embedding φ(x)ij into unnormalized logit scores followed by a softmax activation that outputs the probability distribution over each training category, p(ŷij = s|x; W s) = exp (w>s φ(x)ij) ∑ c∈S exp (w>c φ(x)ij) (7.1) where ŷij represents the prediction for pixel (i, j), wc is the c-th column of W s normalized to have unit length. In contrast to standard CNNs that predict the class posterior by adding 1 × 1 convolution layer or fully connected layer with learnable weights, our classifier weights W s are predefined by a word embedding model, e.g. word2vec (Mikolov et al., 2013b), and then fixed during training. The W s and the semantic projection layer estimate the compatibility between class prototypes and a feature embedding in terms of inner product similarity. Our proposed semantic projection layer is easy to implement by computing the tensor product between feature maps φ(x) and word embedding matrix W s followed by the softmax activation function. After this layer, we directly optimize the standard cross-entropy loss over the spatial dimensions (i, j) ∈I, ∑ (i,j)∈I − log p(ŷij = yij|x) (7.2) which can be viewed as maximizing the negative log likelihood of predicting each pixel as its true label yij. Since there are no learnable parameters at the semantic projection layer, the optimization is over parameters of the visual-semantic embed- ding extractor φ. Compared to the standard semantic segmentation network, we have made subtle yet critical changes, i.e. mapping pixels to the semantic word embedding space followed by stacking a projection layer. Inference. At the test time, in ZLSS and FLSS, we predict unseen classes by replacing the word embedding matrix in Eq. (7.1) with W u. Each pixel label is predicted by: argmax u∈U p(ŷij = u|x; W u). (7.3) On the other hand, for GZLSS and GFLSS, we predict both seen and unseen class labels via their word embedding: argmax u∈S∪U p(ŷij = u|x; [W s; W u]). (7.4) The extreme case of the imbalanced data problem occurs when there is no labeled training images of unseen classes, and this results in predictions being biased to seen 7.3 approach 115 classes. To fix this issue, we follow (Chao et al., 2016) and calibrate the prediction by reducing the scores of seen classes, which leads to: argmax u∈S∪U p(ŷij = u|x; [W s; W u])− γI[u ∈S] (7.5) where I = 1 if u is a seen class and 0 otherwise, γ ∈ [0, 1] is the calibration factor tuned on a held-out validation set. Theoretically, the semantic projection layer allows our model to predict any class by simply copying its word embedding to the classifier weights. However, intuitively, the model can only perform well on the classes that share visual similarities with training classes. Hence, the word embedding ought to capture the similarity between classes. Two-stage training in few-label setting. In our FLSS and GFLSS, we train a model with both Ds that includes a large number of samples per seen class and Du that has only a few samples per unseen, i.e. novel, class. This is a typical imbalanced learning problem. The naive idea is to learn using both seen and unseen class samples within a mini-batch sampled uniformly from the whole training data. As expected, this leads to good performance on seen classes but inferior performance on unseen classes. Another strategy is to oversample unseen classes by first uniformly sampling a mini-batch of classes and selecting one sample from each of those classes. We found that this strategy remedies the imbalance issues to some extent but the results still remain unsatisfactory. On the other hand, fine-tuning the learned classifier on unseen class samples, i.e. after the initial optimization with only seen class samples, yields better results on unseen classes in FLSS as well as better overall results in GFLSS. Hence, we report our results in this setting. 7.3.2 Baseline: Hinge Visual-Semantic Loss (HVSL) The choice of the loss function turns out to be important in zero-label semantic segmentation. Hence, in this section, we develop a baseline that shares the same embedding extractor φ as our SPNet but adopts the hinge visual-semantic loss instead of cross-entropy loss. Indeed hinge visual-semantic loss constitutes the most widely used loss function for zero-shot image classification (Akata et al., 2015a; Bansal et al., 2018; Frome et al., 2013; Zhang and Saligrama, 2016; Xian et al., 2016). In the context of semantic segmentation, we define the following hinge ranking loss for a single training example (x, y) as, ∑ (i,j)∈I ∑ s∈S [∆(s, yij) + w > s φ(x)ij − w > yij φ(x)ij]+ (7.6) where ∆(s, yij) = 1 if s 6= yij otherwise 0, φ(x)ij is the visual-semantic embedding for pixel (i, j) in image x, yij is its corresponding ground-truth label. In practice, we follow (Frome et al., 2013) to truncate the sum by randomly sampling one class that is not ground-truth. 116 chapter 7. zero-label and few-label semantic segmentation 7.4 experiment In this section, we present both quantitative and qualitative results of zero-label semantic segmentation and few-label semantic segmentation. Datasets. We evaluate our model on the challenging COCO-stuff (Caesar et al., 2018) and PASCAL-VOC 2012 (Everingham et al.) datasets. COCO-stuff has 164K images with dense pixel-level annotations from 172 classes including 80 thing classes, 91 stuff classes. PASCAL-VOC is a smaller dataset which contains 13K images from 20 classes. Word embeddings. Encoding the semantic similarity between labels plays an im- portant role in bridging the gap between seen and unseen class predictions. In this work, we study two different word embedding models, i.e. word2vec (Mikolov et al., 2013b) trained on Google News (Wang et al., 2018a) and fastText (Joulin et al., 2016a) trained on Common Crawl (Mikolov et al., 2018). The word embeddings of classes that contain multiple words are obtained by averaging the embeddings of each individual word. Implementation details. We implement our SPNet model with PyTorch (Paszke et al., 2017). We apply ImageNet pretrained VGG-16 (Simonyan and Zisserman, 2014b) and ResNet-101 (He et al., 2016) as our backbone to extract features, and our model is built on the DeepLab-v2 (Chen et al., 2018) that first extract features and apply atrous spatial pyramid pooling layer to produce the visual features, whose dimension is the same as the dimension of the semantic embedding space (i.e., 300 for fast-text and word2vec; 600 for their concatenation). In this work, for VGG backbone we apply Adam solver (Kingma and Ba, 2014) with initial learning rate 1.0 × 10−4, and for ResNet we use SGD with initial learning rate 2.5 × 10−4. Following (Chen et al., 2018), we use the “poly” learning rate policy where current learning rate is the initial one multiplied by (1 − itermax iter ) power, and we set power to 0.9. Momentum and weight decay are set to 0.9 and .0005. 7.4.1 Zero-Label Semantic Segmentation Task One of the contributions of our work is to propose a new task of zero-label semantic segmentation (ZLSS). In this section, we propose two benchmarks with zero-label data splits and detail the zero-label evaluation protocol. Proposed zero-label dataset splits. The zero-label assumption, i.e. similar to the zero-shot assumption (Xian et al., 2019b), states that none of the pixel values of the query images are allowed to belong to the classes that were used in any part of the training procedure, i.e. be it the model training or CNN training. This means that as CNNs are commonly trained on ImageNet 1K, none of the test classes should overlap with it. Following this rule, in COCO-Stuff dataset, we create a new zero-label class split by selecting 15 classes as unseen and the rest of the 167 classes as seen classes as they appear in ImageNet 1K which was used to pretrain ResNet. 7.4 experiment 117 # classes # images train+val test train+val test COCO-Stuff 155+12 15 116287+2000 5000 PASCAL-VOC 12+3 5 11185 + 500 1449 Table 7.1: Statistics of data splits for COCO-Stuff and PASCAL-VOC datasets in terms of the number of classes and the number of images in the training and test splits. In contrast to zero-shot image classification, we do not remove images that contain unseen classes from the training set, otherwise most of training images will be eliminated because seen and unseen classes co-occur frequently. Instead, we utilize the whole training set but ignore the labels of pixels belonging to unseen classes during training, i.e. these pixels do not effect the loss we optimize in any stage of the training. For PASCAL-VOC, since (a) only 4 classes are unseen in ImageNet 1K, (b) one of the candidate class ‘person’ has no semantically similar class present in the dataset, (c) all vehicles appear in ImageNet thus reducing candidate diversity - we simply take the first 15 classes as seen classes and the last 5 classes as unseen classes. We use the train/val split provided by the COCO-Stuff dataset: 118K training images as our training set and 5K validation images as our test set, and PASCAL-VOC: 11K training images and 1.4K test images. Following the cross- validation procedure of (Xian et al., 2019b), we further hold out a subset of training classes as our validation set for tuning hyperparameters. More details about our data splits are shown in Table 7.1. Evaluation protocol. The intersection-over-union (IoU), i.e. the standard evaluation criteria commonly used in semantic segmentation, quantizes the overlap between the predicted mask and the target mask. It is defined to be the size of the intersection between predicted and target regions divided by the union of them. For each class, its mean IoU is computed by averaging the IoU over all the query images. In ZLSS, as the test-time search space is restricted to be unseen classes we report the mean IoU averaged over unseen classes. In GZLSS, the search space becomes the union of seen and unseen classes. In analogy to generalized zero-shot image classification (Xian et al., 2019b), we report the mean IoU on seen classes, the mean IoU on unseen classes and the harmonic mean (H) of them, which is defined as, H = 2 ∗ mIoUseen ∗ mIoUunseen mIoUseen + mIoUunseen (7.7) where mIoUseen and mIoUunseen represents the mean IoU of seen classes and unseen classes respectively. Similarly, in few-label semantic segmentation, we report mean IoU on unseen classes, but in generalized few-label semantic segmentation, the mean IoU over all classes is reported. 118 chapter 7. zero-label and few-label semantic segmentation fastText (ft) word2vec (w2v) ft + w2v HVSL 25.8 25.3 31.8 SPNet 33.1 32.1 35.2 Table 7.2: Effect of word embeddings: Mean IoU of unseen classes in ZLSS with different word2vec, fastText and their combination on COCO-Stuff. Both HVSL and SPNet are based on ResNet101. COCO-Stuff PASCAL VOC SPNet-VGG 26.3 47.4 SPNet-ResNet101 35.2 49.5 Table 7.3: Effect of CNN architectures: ZLSS with different CNN architectures, i.e. VGG and ResNet101 on COCO-Stuff and PASCAL-VOC. Word embedding is the ft + w2v. 7.4.1.1 SPNet Model Analysis for ZLSS In this section, we provide an extensive evaluation for different design choices of our model. Effect of word embeddings. We compare our SPNet model with HVSL and study the effect of different word embeddings in Table 7.2. We investigate three types of word embeddings, i.e. fastText, word2vec and their concatenation. Our first observation is that SPNet performs significantly better than HVSL wrt. all the word embedding types, e.g. SPNet achieves 33.1 vs 25.8 with fastText, and 32.1 vs 25.3 with word2vec compared to HVSL. This implies that the cross-entropy loss is more suitable to the ZLSS task than hinge loss. Furthermore, we observe that fastText and word2vec achieve comparable results, and combining them significantly boosts the performance, e.g. mean IoU of SPNet are improved from 33.1 and 32.1 to 35.2. This indicates that fastText and word2vec contain complementary information. Hence, for the rest experiments, we use SPNet with fastText and word2vec combined. Effect of CNN architectures. Our aim here is to compare different CNN architec- tures that are used as the backbone network to encode images in DeepLab-v2 (Chen et al., 2018). Table 7.3 shows the ZLSS results with VGG16 (Simonyan and Zisserman, 2014b) and ResNet101 (He et al., 2016). We first observe that with VGG16, the results are lower than with ResNet101 on both COCO-Stuff and PASCAL-VOC which im- plies that ResNet101 generate stronger features than VGG16 for this task. Besides, these results show that our SPNet achieves reasonably good results in ZLSS with both CNN architectures. Specifically, on COCO-stuff, SPNet obtains 26.3% mIoU with VGG16 and 35.2% mIoU with ResNet101. This is promising because our model does not require expensive dense pixel-level annotations for each class, e.g. it is not trained with any of the 15 unseen class labels of COCO-Stuff. This also indicates 7.4 experiment 119 Figure 7.3: mIoU of unseen classes on COCO-Stuff ordered wrt average object size (left to right). that our model is easily adapted to various semantic segmentation architectures. Effect of the object size. We study the difficulty of zero-label semantic segmentation as a function of object sizes. Figure 7.3 presents a plot of per class mIoU score for the unseen classes in COCO-Stuff. The classes are ordered according to their average object sizes – with the largest on the right. It shows that there is a tendency that the performance is better for classes with larger objects. The plot also indicates that the knowledge transfer from seen to unseen classes is in general successful for the challenging stuff classes, such as, tree (59.3%), grass (59.7%, clouds (62.2%), considering the fact that they do not have semantically similar classes present in ImageNet 1K. We also observe that our model performs well for cow (61.3%) however the result is quite poor the other unseen animal class giraffe (0.2%). 7.4.1.2 Generalized Zero-Label Semantic Segmentation GZLSS is a practical segmentation setting as the test time search space contains both seen and unseen classes, i.e. the pixel can be assigned to one of the seen or one of the unseen classes. Since the training images contain only labeled pixels of seen classes, at the test time, prediction will be biased to seen classes. Hence, this is a particularly challenging task. We alleviate this issue by using the calibrated classifier formulated in Eq. (7.5), which reduces the prediction scores of seen classes by a calibration factor γ. We select the optimal γ value based on the best harmonic mean IoU on a held-out validation set. Figure 7.4 shows the mean IoU on unseen classes, seen classes and their harmonic mean on COCO-Stuff and PASCAL VOC datasets. On COCO-Stuff SPNet obtains 0.2% mean IoU on unseen classes while IoU on 120 chapter 7. zero-label and few-label semantic segmentation Figure 7.4: GZLSS results on COCO-Stuff and PASCAL-VOC. We report mean IoU of unseen classes, seen classes and their harmonic mean (perception model is based on ResNet101 and the semantic embedding is ft + w2v). SPNet-C represents SPNet with calibration. ZSL GZSL CUB SUN AWA CUB SUN AWA ALE 54.9 58.1 59.9 34.4 26.3 27.5 SJE 53.9 53.7 65.6 33.6 19.8 19.6 SYNC 56.3 55.6 54.0 19.8 13.4 16.2 GFZSL 49.3 60.6 68.3 0.0 0.0 3.5 SPNet 56.5 60.7 66.2 36.6 39.6 24.7 Table 7.4: SPNet loss on (generalized) zero-shot learning tasks. Top-1 accuracy on unseen classes is reported for ZSL and harmonic mean of seen and unseen classes is for GZSL. seen classes is high, i.e. 34.05%. This is expected, in fact the same trend is observed in generalized zero-shot image classification task (Xian et al., 2019b; Chao et al., 2016). On the other hand, after calibration i.e. SPNet-C, on COCO-Stuff, mean IoU of unseen classes jumps to 8.33% while maintaining high mIoU on seen classes, i.e. 34.52% and overall SPNet-C achieves a harmonic mean of 13.42%. This is due to the fact that after calibration, i.e. reducing prediction scores of seen classes, pixels get predicted as seen classes less frequently. On PASCAL-VOC we observe a similar trend. While SPNet performs poorly on unseen classes, i.e. 0.01% mIoU, with calibration this increases to 29.33% mIoU. Accordingly, SPNet-C achieves an impressive 42.45% harmonic mIoU. These results demonstrate that our SPNet does not only tackle ZLSS but also can handle the more practical GZLSS via predictor calibration. 7.4 experiment 121 1 2 5 10 20 # training samples per class 20 30 40 50 60 70 m Io U o ve r U n se e n C la ss e s (i n % ) COCO-Stuff SPNet Baseline 1 2 5 10 20 # training samples per class 20 40 60 80 100 m Io U o ve r U n se e n C la ss e s (i n % ) PASCAL-VOC SPNet Baseline Figure 7.5: Few-label semantic segmentation (FLSS) on COCO-Stuff and PASCAL VOC with increasing number of training samples per class, i.e. n ∈{1, 2, 5, 10, 20}. 7.4.1.3 (Generalized) Zero-Shot Image Classification We evaluate our SPNet on the zero-shot image classification task on three benchmark datasets, i.e. CUB (Welinder et al., 2010) (200 types of birds with 312 attributes), SUN (Patterson and Hays, 2012) (717 scenes with 102 attributes) and AWA (Lampert et al., 2013) (50 classes of animals with 85 attributes) with various sizes and com- plexities, following the data splits and evaluation protocol of (Xian et al., 2019b). We train SPNet with cross-entropy loss: L(x, y) = − log exp (φ(x)>Vwy) ∑c∈S exp (φ(x)>Vwc) (7.8) where φ(x) is 2048-dim image feature extracted from a pre-trained ResNet101 (no fine-tuning on the task), wc ∈ Rdw is the class attribute of class c, V ∈ R2048×dw is the linear embedding we aim to learn. Table 7.4 shows that both in ZSL and GZSL settings, our SPNet improves over the state of the art on both CUB and SUN while it obtains the second best results on AWA despite the simplicity of our model. Both ALE (Akata et al., 2015a) and SJE (Akata et al., 2015b) utilize the visual-semantic hinge loss, SYNC (Changpinyo et al., 2016) align visual and semantic embedding space using manifold learning, and GFZSL (Verma and Rai, 2017) learns a generative model to capture the class conditional distribution. However, our SPNet simply projects image feature into the class embedding space and apply the standard softmax classifier with the class embedding being the weights. 7.4.2 Few-Label Semantic Segmentation Task The (Generalized) few-label semantic segmentation (FLSS and GFLSS) tasks arise in many real-world applications since class distribution in semantic segmentation 122 chapter 7. zero-label and few-label semantic segmentation 1 2 5 10 20 # training samples per class 20 22 24 26 28 30 32 34 O ve ra ll m Io U ( in % ) COCO-Stuff SPNet Baseline 1 2 5 10 20 # training samples per class 10 20 30 40 50 60 70 80 90 O ve ra ll m Io U ( in % ) PASCAL-VOC SPNet Baseline Figure 7.6: Generalized few-label semantic segmentation (GFLSS) on COCO-Stuff and PASCAL VOC with increasing number of training samples per class, i.e. n ∈ {1, 2, 5, 10, 20}. is usually skewed, e.g. there are far more road pixels than bicycles. In contrast to ZLSS where the training set has no labeled example from unseen (novel) classes, in FLSS and GFLSS, the model is trained with all classes. At the evaluation time, the goal of FLSS is to segment only the novel classes, while GFLSS aims to segment both base and novel classes. For each novel class, we randomly draw n ∈{1, 2, 5, 10, 20} images that contain this class from the training set and disable ignore-label condition for those novel pixels. In addition, we develop a simple baseline based on the original DeepLab-v2 (Chen et al., 2018), which is finetuned on novel classes after an initial optimization on base classes. We carry out experiments in FLSS and GFLSS with the baseline and our SPNet on COCO-Stuff and PASCAL-VOC. In FLSS task, Figure 7.5 shows the comparison results with the baseline model (Chen et al., 2018). Our SPNet yields significantly better results than the baseline in all cases on both COCO-Stuff and PASCAL VOC. In particular, when there is only 1 labeled example, our SPNet significantly outperforms the baseline, achieving a mean IoU of 47.90% over 27.69% in COCO-Stuff and 71.52% over 29.17% in PASCAL VOC on FZLSS. The accuracy improvement from 1 labeled sample to 5 labeled samples is significant, i.e. ≈ 20% mIoU for both COCO-Stuff and PASCAL VOC. These results demonstrate the effectiveness of our SPNet when the training samples are scarce. As for GFLSS in Figure 7.6, a similar trend is observed. Our SPNet improves over DeepLab in all cases. The accuracy improvement is steady from 1 to 2, 5, 10, 20 especially on COCO-Stuff. The difference between DeepLab and ours is 21.24% mIoU over both seen and unseen classes on PASCAL VOC when our model has access to only one labeled sample from novel classes. 7.4 experiment 123 (b) (a) Figure 7.7: Qualitative results of our SPNet in 0-, 1- and 5-label semantic segmenta- tion settings on COCO-Stuff on 15 novel classes (color coded at the top). Base classes are masked out with black color. (a) promising results (b) failure cases. 7.4.3 Qualitative Results Figure 7.7 shows the qualitative results obtained by our SPNet in ZLSS and FLSS on COCO-Stuff. Our target 15 novel classes are encoded with the colors shown at the top. Base classes are masked out with black color. Some interesting results are as follows. In the first row and left column, our SPNet is already able to segment two previously unseen classes cows and grass at ZLSS, i.e. 0-label, and results get refined after the model sees more examples. It is also worth noting that our SPNet is able to predict stuff classes, such as road, river, clouds etc., in ZLSS setting. For instance, SPNet successfully segments clouds and roads in the image at the second row and right column, and perfectly segments the river in the image at the third row and left column. Another interesting result is in the left column of 4th row where the model correctly segments the frisbee in 0-label setting but incorrectly labels most pixels as ‘skateboard’ which in fact is another sports category object. On the other hand, some failure cases are shown in the bottom row. Our SPNet fails to predict giraffe at 0-label because shape and appearance of a giraffe vary significantly from seen classes. However, seeing only 1 example is enough to recognize and segment it, which demonstrates the ability of our SPNet in learning from few examples. Again, the result gets refined with 5 labeled examples. These results support our observations in the previous sections and indicate that our SPNet, although simple, adapts its knowledge attained in previously seen examples to unseen ones. 124 chapter 7. zero-label and few-label semantic segmentation 7.5 conclusions In this work, we propose SPNet to semantically segment novel classes with no labeled examples or with only a few samples, within the new tasks of zero-label semantic segmentation and few-label semantic segmentation respectively. This model consists of a visual-semantic embedding module that encodes images in the word embedding space and a semantic projection layer that produces class probabilities. Our SPNet is both conceptually and computationally simple but surprisingly effective and end-to- end trainable. We have shown its applicability across zero-shot image classification to zero-label and few-label semantic segmentation tasks on various benchmark datasets. 8 G E N E R A L I Z E D M A N Y- WAY F E W - S H O T V I D E O C L A S S I F I C A T I O N Contents 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 8.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 8.3 R-3DFSV Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 8.3.1 3D CNN for FSV (3DFSV) . . . . . . . . . . . . . . . . . . . . 129 8.3.2 Retrieval-enhanced 3DFSV (R-3DFSV) . . . . . . . . . . . . . 131 8.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 8.4.1 Experimental settings . . . . . . . . . . . . . . . . . . . . . . . 132 8.4.2 Comparing with the state-of-the-art . . . . . . . . . . . . . . 134 8.4.3 Increasing the number of classes in FSV . . . . . . . . . . . . 136 8.4.4 Evaluating base and novel classes in GFSV . . . . . . . . . . 137 8.4.5 Ablation study and retrieved clips . . . . . . . . . . . . . . . 138 8.4.6 Qualitative results . . . . . . . . . . . . . . . . . . . . . . . . . 140 8.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 I n Chapters 3, 4 and 5, we show that semantic embeddings can be used as an effective way for knowledge transfer on image classification tasks. We extend this idea to the semantic segmentation field in Chapter 6. While most of works for few-shot learning are in the image domain, there are many real-world applications that takes as input videos e.g., self-driving cars and video surveillance. Therefore, in this chapter, we study how to develop efficient methods for the few-shot video classification task where there are only few training examples per class. Our main idea is to improve the issues of video representation learning and lacking of training data. We argue that existing methods with 2D CNNs are unable to learn temporal information and thus develop a simple 3D CNN baseline, surpassing existing methods by a large margin. To circumvent the need of labelled examples, we propose to leverage weakly-labelled videos from a large dataset using video tag retrieval followed by selection of the best clips with visual similarities, yielding further improvement. Our results saturate current 5-way benchmarks for few- shot video classification and therefore we propose a more challenging benchmark involving more classes and a mixture of classes with varying supervision. 125 126 chapter 8. generalized many-way few-shot video classification Classes with many videos Learning 3D CNN Leverage tag-labeled videos Few-shot video learning playing guitar # of class labeled videos # of classes arabesque (ballet) tag-labeled videos + class-labeled videos few-shot videos text Figure 8.1: Leveraging the lack of class-labeled videos (time-consuming to obtain) with tag-labeled videos, few-shot videos and text, our 3D CNN saturates existing benchmarks and enables the more challenging generalized few-shot multi-way video classification task. 8.1 introduction In the video domain annotating data is very time-consuming due to the additional time dimension. A lack of labelled training data is more prominent in fine-grained scenarios such as action recognition. For some fine-grained action classes at the ”tail” of the skewed long-tail distribution (see Figure 8.1 for an illustration), e.g., ‘arabesque in ballet”, collecting enough training videos is even not possible It is thus of great importance to investigate how to learn to classify videos in the limited labeled training data regime. Visual recognition methods that operate in the few-shot learning setting aim to generalize a classifier trained on known classes (often referred to as base classes) with enough training data to unknown (novel) classes with only a few labelled training examples. While considerable attention has been devoted to the scenario of few-shot image classification (Vinyals et al., 2016; Qi et al., 2018; Ravi and Larochelle, 2016; Chen et al., 2019), few-shot video classification is relatively unexplored. Existing few-shot video classification approaches (Zhu and Yang, 2018; Cao et al., 2019) are mostly based on frame-level features extracted from a 2D CNN, which essentially ignores the important temporal information. Although additional temporal modules have been added at the top of a pre-trained 2D CNN, necessary temporal cues may be lost when temporal information is learned on top of static image features. We argue that under-representing temporal cues may negatively impact the robustness of the classifier. In fact, in the few-shot scenario it may be 8.2 related work 127 risky for the model to rely exclusively on appearance and context cues extrapolated from the few examples available. In order to make temporal information available we propose to represent the videos by means of a 3D CNN. While obtaining labelled videos for target classes is time-consuming and chal- lenging, there are many weakly-labelled videos available on the internet, e.g. there are 400,000 tag-labelled videos in the YFCC100M (Thomee et al., 2015) dataset. Our second goal is thus to leverage such tag-labelled videos (Figure 8.1) to alleviate the lack of training data for our few-shot video models. Existing experimental settings for few-shot video classification (Zhu and Yang, 2018; Cao et al., 2019) are limited. Searching for the label among 5 novel classes, i.e. classes with few-shot videos, in each testing episode is restrictive. Moreover, restricting the search space to novel classes at test time, i.e. test set consists of only videos from novel classes and models only have to predict novel classes, and ignoring the base classes is unrealistic because in real-world applications test videos are expected to belong to any class. In this work, our goal is to push the progress of few-shot video classification in three ways: 1) To learn the temporal information, we revisit spatiotemporal CNNs in the few-shot video classification regime. We develop a 3D CNN baseline that maintains significant temporal information within short clips; 2) We propose to retrieve relevant tag-labeled videos from a large video dataset, i.e. YFCC100M, to circumvent the need for class-labeled videos of novel classes; 3) We extend current few-shot video classification evaluation by introducing two challenging experimental settings. In generalized few-shot video classification task, the search space has no restriction in terms of classes. In few-shot video classification with more ways, the search space goes beyond five towards all classes. Our extensive experimental results demonstrate that on existing settings spatiotemporal CNNs outperform the state-of-the-art by a largin margin, and on our proposed settings weakly-labeled videos retrieved using tags successfully tackles both of our new few-shot video classification tasks. 8.2 related work Low-shot learning setup. The low-shot image classification (Mensink et al., 2012; Ravi and Larochelle, 2016; Hariharan and Girshick, 2017) setting uses a large-scale fully labelled dataset for pre-training a DNN and a low-shot dataset with a small number of examples from a disjoint set of classes. The terminology “k-shot n-way classification” means that in the low-shot dataset there are n distinct classes and k examples per class for training. Evaluating with few examples (k small) is bound to be noisy. Therefore, the k training examples are often sampled several times and accuracy results are averaged (Hariharan and Girshick, 2017; Douze et al., 2018). Many authors focus on cases where the number of classes n is small as well, which amplifies the measurement noise. For that case (Ravi and Larochelle, 2016) introduce the notion of “episodes”. One episode is one sampling of n classes and k examples 128 chapter 8. generalized many-way few-shot video classification per class. It is feasible to use distinct datasets for pre-training and low-shot evaluation. Hovever, to avoid dataset bias (Torralba et al., 2011) it is easier to split a large super- vised dataset into a set of “base” classes and a set of “novel” classes. The evaluation is most often performed only on novel classes, except (Hariharan and Girshick, 2017; Xian et al., 2019c; Schoenfeld et al., 2019) who evaluate on the combination of base+novel classes. Recently, a low-shot video classification setup has been proposed (Zhu and Yang, 2018; Dwivedi et al., 2019). They use the same type of decomposition of the dataset as (Ravi and Larochelle, 2016), with learning episodes and random sampling of low-shot classes. In this work, we follow and extend the evaluation protocol of (Zhu and Yang, 2018). Tackling low-shot learning. The simplest low-shot learning approach is to extract embeddings from the images using the pre-trained trunk and train a linear classi- fier (Akata et al., 2015a) or logistic regression (Hariharan and Girshick, 2017) on top using the k training available examples. Another approach is to cast low-shot learn- ing as a similarity search problem (Wang et al., 2019b). The “inprinting” approach (Qi et al., 2018), consists in building a linear classifier from the embeddings of training ex- amples, then fine-tune it. It also belongs to this family, since it is equivalent to doing class-mean similarity search with a cosine distance. As a complementary approach, (Joulin et al., 2016b) has looked into exploiting noisy labels to aid classification. By leveraging tags of 100M images from the YFCC100M dataset (Thomee et al., 2015), they show improvements over Imagenet-pretraining. In this work, we use videos from YFCC100M retrieved by tags to augment and improve training of our classifier. In a meta-learning setup, the the low-shot classifier is assumed to have hyper- parameters or parameters that must be adjusted before training. Thus, there is a preliminary meta-learning step that consists in training those parameters on simu- lated episodes sampled from the main training data. Matching networks (Vinyals et al., 2016) “meta-learns” an LSTM that maps the low-shot training examples into a classifier. Feature hallucination (Wang et al., 2018c) meta-learns how to generate ad- ditional training data for novel classes, directly in the feature space. In MAML (Finn et al., 2017), the embedding classifier is meta-learned to adapt quickly and without overfitting to fine-tuning. Recent works (Chen et al., 2019; Wang et al., 2019b) suggest that state-of-the- art performance can be obtained by methods that do not need meta learning. In particular, (Chen et al., 2019) show that meta-learning methods are less useful when the image descriptors are expressive enough, which is the case when they are from high-capacity networks trained on large datasets. Therefore, we focus on techniques that do not require a meta-learning stage. Deep descriptors for videos. Moving from hand-designed descriptors (Dollár et al., 2005; Laptev, 2005; Sadanand and Corso, 2012; Wang and Schmid, 2013) to learned deep-network based descriptors (Feichtenhofer et al., 2016a,b; Karpathy et al., 2014; Simonyan and Zisserman, 2014a; Wang et al., 2016; Tran et al., 2015) has been 8.3 r-3dfsv approach 129 enabled by labeled large-scale datasets (Kay et al., 2017; Karpathy et al., 2014), and parallel computing hardware. Deep descriptors are either based on 2D-CNN models operating on a frame-by-frame basis with temporal aggregation (Girdhar et al., 2017; Yue-Hei Ng et al., 2015), or more commonly 3D-CNN models operating on sequential sequences of images we refer to as video-clips (Tran et al., 2015, 2018). Recently, ever-more-powerful descriptors have been developed leveraging two-stream architectures using additional modalities (Feichtenhofer et al., 2016b; Simonyan and Zisserman, 2014a), factorized 3D convolutions (Tran et al., 2018, 2019), or multi- scale approaches (Feichtenhofer et al., 2019). While most of these descriptors are trained in a fully supervised way, advances in learning deep descriptors in either weakly-supervised (Yalniz et al., 2019; Ghadiyaram et al., 2019; Mahajan et al., 2018) or self-supervised fashion have been explored as well (Korbar et al., 2018; Owens and Efros, 2018). 8.3 r-3dfsv approach In the few-shot learning setting (Zhu and Yang, 2018), classes are split into two disjoint label sets, i.e., base classes (denoted as Cb) that have a large number of training examples, and novel classes (denoted as Cn) that have only a small set of training examples. Let Xb denote the training videos with labels from the base classes and Xn be the training videos with labels from the novel classes (|Xb|� |Xn|). Given the training data Xb and Xn, the goal of the conventional few-shot video classification task (FSV) (Zhu and Yang, 2018; Cao et al., 2019) is to learn a classifier which searches for the labels among novel classes at test time. As the test-time search space is restricted to novel classes, the FSV setting is unrealistic. Thus, in this chapter, we additionally study the generalized few-shot video classification (GFSV) which allows videos at test time to belong to any base or novel class. 8.3.1 3D CNN for FSV (3DFSV) In this section, we introduce our spatiotemporal CNN baseline for few-shot video classification (3DFSV). Our approach in Figure 8.2 consists of 1) a representation learning stage which trains a spatiotemporal CNN on the base classes, 2) a few-shot learning stage that trains a linear classifier for novel classes with few labelled videos, and 3) a testing stage which evaluates the model on unseen test videos. The details of each of these stages are given below. Representation learning. Our model uses a spatiotemporal CNN (Tran et al., 2018) φ : RF×3×H×W → Rdv , encoding a short, fixed-length video clip of F RGB frames with spatial resolution H × W to a feature vector in the dv-dimensional Euclidean space. On top of the feature extractor φ, we define a linear classifier f (•; Wb) parameterized by a weight matrix Wb ∈ Rdv×|Cb|, producing a probability distribution over the base classes. The objective is to jointly learn the network φ and the classifier Wb by minimizing the cross-entropy classification loss on video clips randomly 130 chapter 8. generalized many-way few-shot video classification fixed Stage 1: Pretraining Stage 2: Fine-tuning Stage 3: Meta-testing Meta-train set 10 clips per videoR25D R25D Linear layer & softmax output CE Sampling clip batch Sampling clip batch Linear layer & softmax output CE Sampling clip batch R25D (fixed) Linear layer & softmax output CE Videos from tag retrieval support set 10 clips per video Select best clips R25D (fixed) Linear layer & softmax output average over 10 clips of a video query set 10 clips per video Tag retrieval “Dancing ballet” Select best clips “Busking” Class name "arnold,bandstan d,busking,circus" "ballet" Video tagsFasttext embedding Retrieved video support video R25D (fixed)10 clips per video 10 clips per video R25D (fixed) Clip embeddings avg Top-2 clip Nearest Neighbor search Sampling clip batch Representation learning Few-shot learning Testing pre-training fine-tuning Videos from tag retrieval support set Tag-based video retrieval Select best clips Sampling clip batch Fasttext embedding ballet dancing ballet busking avg avg Top-2 clip Select best clips Nearest Neighbor search + R2 5D R2 5D playing drums ice-skating playing in a band Representation learning Few-shot learning Testing Videos from base classes on target dataset Tag-based video retrieval Fasttext embedding ballet dancing ballet busking Top-2 clip Select best clips + R( 2+ 1)D R( 2+ 1)D Dancing ballet Ice skating Busking Random or pretrained ` Class name of a novel class Tags of YFCC100M YFCC100M candidate videos Clip embedding Nearest Neighbor search Class prototype of a novel class Clip of candidate videos 1-shot video Support set Novel classes FSV GFSV Playing golf Figure 8.2: Our approach is composed of three steps: representation learning, few- shot learning and testing. In representation learning, we train a R(2+1)D from the random initialization or Sports1M-pretrained model on the base classes of our target dataset. In few-shot learning, given few-shot support videos from novel classes, we first retrieve a list of candidate videos for each class from YFCC100M (Thomee et al., 2015) using their tags, followed by selecting the best matching short clips from the retrieved videos using visual features. Those clips serve as additional training examples to learn classifiers that generalize to novel classes at test time. sampled from training videos Xb of base classes. More specifically, given a training video x ∈Xb with a label y ∈Cb, the loss for a video clip xi ∈ RF×3×H×W sampled from video x is defined as, L(xi) = − log σ(W Tb φ(xi))y (8.1) where σ denotes the softmax function that produces a probability distribution and σ(•)y is the probability at class y. Following (Chen et al., 2019), we do not do meta-learning, so we can use all the base classes as a whole to learn the network φ. Few-shot learning. This stage aims to adapt the learned network φ to recognize novel classes Cn with a few training videos Xn. To reduce overfitting, we fix the network φ and learn a linear classifier f (•, Wn) by minimizing the cross-entropy loss on video clips randomly sampled from videos in Xn, where Wn ∈ Rdv×|Cn| is the weight matrix of the linear classifier. Similarly, we define the loss for a video clip xi sampled from x ∈Xn with a label y as L(xi) = − log σ(W Tn φ(xi))y (8.2) Testing. The spatiotemporal CNN operates on fixed-length video clips of F RGB frames and the classifiers make clip-level predictions. At test time, the model must 8.3 r-3dfsv approach 131 predict the label of a test video x ∈ RT×3×H×W with arbitrary time length T. We achieve this by randomly drawing a set L of clips {xi}Li=1 from video x, where xi ∈ RF×3×H×W . The video-level prediction is then obtained by averaging the prediction scores after the softmax function over those L clips. For few-shot video classification (FSV), this is: 1 L L ∑ i=1 f (xi; Wn). (8.3) For generalized few-shot video classification (GFSV), both base and novel classes are taken into account and we concatenate the base class weight Wb learned in the representation stage with the novel class weight Wn learned in the few-shot learning stage: 1 L L ∑ i=1 f (xi; [Wb; Wn]). (8.4) 8.3.2 Retrieval-enhanced 3DFSV (R-3DFSV) During few-shot learning, fine-tuning the network φ or learning the classifier f (•; Wn) alone is prone to overfitting. Moreover, class-labeled videos to be used for fine- tuning are scarce. Instead, the hypothesis is that leveraging a massive collection of weakly-labeled real-world videos would improve our novel-class classifier. Thus, for each novel class, we propose to retrieve a subset of weakly-labelled videos, associate pseudo-labels to these retrieved videos and use them to expand the training set of novel classes. For efficiency and to reduce the label noise, we adopt the following two-step retrieval approach. Tag-based video retrieval. The YFCC100M dataset (Thomee et al., 2015) includes around 800K videos collected from Flickr, with a total length of over 8000 hours. Processing a large collection of videos has a high computational demand and a large portion of them are irrelevant to our target classes. Thus, we restrict ourselves to videos with tags related to those of the target class names. Leveraging information orthogonal with the actual video content increases the visual diversity. Given a video with user tags {ti}Si=1 where ti ∈T is a word or phrase and S is the number of tags, we represent it with an average tag embedding 1S ∑ S i=1 ϕ(ti). The tag embedding ϕ(.) : T → Rdt maps each tag to a dt dimensional embedding space, e.g., Fasttext (Joulin et al., 2017). Similarly, we can represent each class by the text embedding of its class name and then for each novel class c, we compute its cosine similarity to all the video tags and retrieve the N most similar videos according to this distance. Selecting best clips. The video tag retrieval selects a list of N candidate videos for each novel class. However, those videos are not yet suitable for training because the annotation may be erroneous, which can harm the performance. Besides, some weakly-labelled videos can last as long as an hour. We thus propose to select the 132 chapter 8. generalized many-way few-shot video classification best short clips of F frames from those candidate videos using the few-shot videos of novel classes. Given a set of few-shot videos X cn from novel class c, we randomly sample L video clips from each video. We then extract features from those clips with the spatiotemporal CNN φ and compute the class prototype by averaging over clip features. Similarly, for each retrieved candidate video of novel class c, we also randomly draw L video clips and extract clip features from φ. Finally, we perform a nearest neighbour search with cosine distance to find the M best matching clips of the class prototype. This can be formulated as max xj cos(pc, φ(xj)) (8.5) where pc denotes the class prototype of class c, xj is the clip belonging to the retrieved weakly-labeled videos. After repeating this process for each novel class, we obtain a collection of pseudo-labeled video clips Xp = {X cp} |Cn| c=1 where X c p indicates the best M video clips from YFCC100M for novel class c. Batch denoising. The retrieved video clips contribute to learning a better novel class classifier f (•; Wn) in the few-shot learning stage by expanding the training set of novel classes from Xn to Xn ⋃ Xp. Xp may inevitably include noisy video clips with wrong labels. During the optimization, we adopt a simple strategy to alleviate the noise: we construct a mini-batch with half video clips from Xn and another half video clips from Xp at each iteration. The purpose is to reduce the gradient noise in each mini-batch by enforcing that half of the samples are correct. 8.4 experiments In this section, we first describe the existing experimental settings and our proposed setting for few-shot video recognition. We then present the results comparing our approaches with the state-of-the-art methods in the existing setting on two datasets, the results of our approach in our proposed settings, model analysis and qualitative results. 8.4.1 Experimental settings Here we describe the four datasets we use, previous few-shot video classification protocols and our settings. Datasets. Kinetics (Kay et al., 2017) is a large-scale video classification dataset which covers 400 human action classes including human-object and human-human interactions. Its videos are collected from Youtube and trimmed to include only one action class. The UCF101 (Soomro et al., 2012) dataset is also collected from Youtube videos, consisting of 101 realistic human action classes, with one action label in each video. SomethingV2 (Goyal et al., 2017) is a fine-grained human action recognition dataset, containing 174 action classes, in which each video shows a 8.4 experiments 133 # classes # videos train val test train val test Kinetics 64 12 24 6400 1200 2400+2288 UCF101 64 12 24 5891 443 971+1162 SomethingV2 64 12 24 67013 1926 2857+5243 Table 8.1: Statistics of our data splits on Kinetics, UCF101 and SomethingV2 datasets. We follow the train, val, and test class splits of (Zhu and Yang, 2018) and (Cao et al., 2019) on Kinetics and SomethingV2 respectively. In addition, we add test videos (the second number under the second test column) from train classes for GFSV. We also introduce a new data split on UCF101 and for all datasets we propose 5-,10-,15-,24-way (the maximum number of test classes) and 1-,5-shot setting. human performing a predefined basic action, such as “picking something up” and “pulling something from left to right”. We use the second release of the dataset. YFCC100M (Thomee et al., 2015) is the largest publicly available multimedia collection with about 99.2 million images and 800k videos from Flickr. Although none of these videos are annotated with a class label, half of them (400k) have at least one user tag. We use the tag-labeled videos of YFCC100M to improve the few-shot video classification. Prior setup. The existing practice of (Zhu and Yang, 2018) and (Cao et al., 2019) indicates randomly selecting 100 classes on Kinetics and on SomethingV2 datasets respectively. Those 100 classes are then randomly divided into 64, 12, and 24 non- overlapping classes to construct the meta-training, meta-validation and meta-testing sets. The meta-training and meta-validation sets are used for training models and tuning hyperparameters. In the testing phase of this meta-learning setting (Zhu and Yang, 2018; Cao et al., 2019), each episode simulates a n-way, k-shot classification problem by randomly sampling a support set consisting of k samples from each of the n classes, and a query set consisting of one sample from each of the n classes. While the support set is used to adapt the model to recognize novel classes, the classification accuracy is computed at each episode on the query set and mean top-1 accuracy over 20,000 episodes constitutes the final accuracy. Proposed setup. The prior experimental setup is limited to n = 5 classes in each episode, even though there are 24 novel classes in the test set. As in this setting the performance saturates quickly, we extend it to 10-way, 15-way and 24-way settings. Similarly, the previous meta-learning setup assumes that test videos all come from novel classes. On the other hand, it is important in many real-world scenarios that the classifier does not forget about previously learned classes while learning novel classes. Thus, we propose the more challenging generalized few-shot video classification (GFSV) setting where the model needs to predict both base and novel classes. To evaluate a n-way k-shot problem in GFSV, in addition to a support and a query 134 chapter 8. generalized many-way few-shot video classification set of novel classes, at each test episode we randomly draw an additional query set of 5 samples from each of the 64 base classes. We do not sample a support set for base classes because base class classifiers have been learned during the representation learning phase. We report the mean top-1 accuracy of both base and novel classes over 500 episodes. Kinetics, UCF101 and SomethingV2 datasets are used as our few-shot video classification datasets with disjoint sets of train, validation and test classes (see Table 8.1 for details). Here we refer to base classes as train classes. Test classes include the classes we sample novel classes from in each testing episode. For Kinetics and SomethingV2, we follow the splits proposed by (Zhu and Yang, 2018) and (Cao et al., 2019) respectively for a fair comparison. It is worth noting that 3 out of 24 test classes in Kinetics appear in Sports1M, which is used for pretraining our 3D ConvNet. But the performance drop is negligible if we replace those 3 classes with other 3 random kinetics classes that are not present in Sports1M (more details can be found in the supplementary material). Following the same convention, we randomly select 64, 12 and 24 non-overlapping classes as train, validation and test classes from UCF101 dataset, which is widely used for video action recognition. We ensure that in our splits the novel classes do not overlap with the classes of Sports1M. For the GFSV setting, in each dataset the test set includes samples from base classes coming from the validation split of the original dataset. Implementation details. Unless otherwise stated our backbone is a 34-layer R(2+1)D (Tran et al., 2018) pretrained on Sports1M (Karpathy et al., 2014) which takes as input video clips consisting of F = 16 RGB frames with spatial resolution of H = 112 × W = 112. We extract clip features from the dv = 512 dimensional top pooling units of the R(2+1)D. In the representation learning stage, we fine-tune the R(2+1)D with a constant learning rate 0.001 on all datasets and stop training when the validation accuracy of base classes saturates. We perform standard spatial data augmentation including random cropping and horizontal flipping. We also apply temporal data augmen- tation by randomly drawing 8 clips from a video in one epoch. In the few-shot learning stage, the same data augmentation is applied and the novel class classifier is learned with a constant learning rate 0.01 for 10 epochs on all the datasets. At test time, we randomly draw L = 10 clips from each video and average their predictions for a video-level prediction. As for the retrieval approach, we use the 400 dimensional (dt = 400) fast- text (Joulin et al., 2016a) embedding trained with GoogleNews. We first retrieve N = 20 candidate videos for each class with video tag retrieval and then select M = 5 best clips among those videos with visual similarities. 8.4.2 Comparing with the state-of-the-art In this section, we compare our model with the state-of-the-art in existing evaluation settings which mainly consider 1-shot, 5-way and 5-shot, 5-way problems and 8.4 experiments 135 Kinetics SomethingV2 Method 1-shot 5-shot 1-shot 5-shot CMN (Zhu and Yang, 2018) 60.5 78.9 - - CMN++ (Cao et al., 2019) 65.4 78.8 34.4 43.8 TAM (Cao et al., 2019) 73.0 85.8 42.8 52.3 3DFSV (ours, scratch) 48.9 67.8 57.9 75.0 3DFSV (ours, pretrained) 92.5 97.8 59.1 80.1 R-3DFSV (ours, pretrained) 95.3 97.8 - - Table 8.2: Comparing with the state-of-the-art few-shot video classification methods. We report top-1 accuracy on the novel classes of Kinetics and SomethingV2 for 1-shot and 5-shot tasks (both in 5-way). 3DFSV (ours, scratch): our R(2+1)D is trained from scratch; 3DFSV (ours, pretrained): our model is trained from the Sports1M- pretrained R(2+1)D. R-3DFSV (ours, pretrained): our model with retrieved videos, trained from the Sports1M-pretrained R(2+1)D. evaluate only on novel classes, i.e., FSV. The baselines CMN (Zhu and Yang, 2018) and TAM (Cao et al., 2019) are considered as the state-of-the-art in few-shot video classification. CMN (Zhu and Yang, 2018) proposes a multi-saliency embedding function to extract video descriptor, and few-shot classification is then done by the compound memory network (Kaiser et al., 2017). TAM (Cao et al., 2019) proposes to leverage the long-range temporal ordering information in video data through temporal alignment. They additionally build a stronger CMN, namely CMN++, by using the few-shot learning practices from (Chen et al., 2019). We use their reported numbers for fair comparison. The results are shown in Table 8.2. As the code from CMN (Zhu and Yang, 2018) and TAM (Cao et al., 2019) is not available at the time of submission we do not include UCF101 results. On Kinetics, we observe that our 3DFSV (pretrain) approach, i.e. without retrieval, outperforms the previous best results by over 19% in 1-shot case (73.0% of TAM vs 92.5% of ours), and by 12% in 5-shot case (85.8.0% of TAM vs 97.8% of ours). On SomethingV2 dataset, we would like to first highlight that our 3DFSV (scratch) significantly improves over TAM by 15.1% in 1-shot (42.8% of TAM vs 57.9% of ours) and by surprisingly 22.7% in 5-shot (52.3% of TAM vs 75.0% of ours). This is encouraging because the 2D CNN backbone of TAM is pretrained on ImageNet, while our R(2+1)D backbone is trained from random initialization. Our 3DFSV (pretrain) yields further improvement after using the Sports1M- pretrained R(2+1)D. We observe that the effect of the Sports1M-pretrained model on SomethingV2 is not as significant as on Kinetics because there is a large domain gap between Sports1M to SomethingV2 datasets. Those results show that a simple linear classifier on top of a pretrained 3D CNN, e.g. R(2+1)D (Tran et al., 2018), performs better than sophisticated methods with a pretrained 2D ConvNet as a backbone. Although as shown in C3D (Tran et al., 2015), I3D (Carreira and Zisserman, 2017), R(2+1)D (Tran et al., 2018), spatiotemporal CNNs have an edge over 2D spatial 136 chapter 8. generalized many-way few-shot video classification ConvNet (He et al., 2016) in the fully supervised video classification with enough annotated training data, we are the first to apply R(2+1)D in the few-shot video classification with limited labeled data. It is worth noting that our R(2+1)D is pretrained on the Sports1M while the 2D ResNet backbone of CMN (Zhu and Yang, 2018) and TAM (Cao et al., 2019) is pretrained on ImageNet. A direct comparison between 3D CNNs and 2D CNNs is hard because they are designed for different input data. While it is standard to use an ImageNet-pretrained 2D CNN in image domains, it is common to apply a Sports1M-pretrained 3D CNN in video domains. One of our goals is to establish a strong few-shot video classification baseline with 3D CNNs. Intuitively, the temporal cue of the video is better preserved when clips are processed directly by a spatiotemporal CNN as opposed to processing them as images via a 2D ConvNet. Indeed, even though we train our 3DFSV from the random initialization on SomethingV2 dataset which requires strong temporal information, our results still remain promising. This confirms the importance of 3D CNNs for few-shot video classification. Our R-3DFSV (pretrain) approach, i.e. with retrieved weakly-labeled video clips, lead to further improvements in 1-shot case (3DFSV (pretrain) 92.5% vs R- 3DFSV (pretrain) 95.3) on Kinetics dataset. This implies that weakly-labeled videos retrieved from the YFCC100M dataset include discriminative cues for Kinetics tasks. In 5-shot, our R-3DFSV (pretrain) approach achieves similar performance as our 3DFSV (pretrain) approach however with an 97.8% this task is almost saturated. We do not retrieve any weakly-labeled videos for the SomethingV2 dataset because it is a fine-grained dataset of basic actions and it is unlikely that YFCC100M includes any relevant video for that dataset. As a summary, although 5-way classification setting is still challenging to those methods with 2D ConvNet backbone, the results saturate with the stronger spatiotemporal CNN backbone. 8.4.3 Increasing the number of classes in FSV Although prior works evaluated few-shot video classification on 5-way, i.e. the number of novel classes at test time is 5, our 5-way results are already saturated. Hence, in this section, we go beyond 5-way classification and extensively evaluate our approach in the more challenging, i.e., 10-way, 15-way and 24-way few-shot video classification (FSV) setting. Note that from every class we use one sample per class during training, i.e. one-shot video classification. As shown in Figure 8.3, our R-3DFSV method exceeds 95% accuracy both in Kinetics and UCF101 datasets for 5-way classification. With the increasing number of novel classes, e.g. 10, 15 and 24, as expected, the performance degrades. Note that, our R-3DFSV approach with retrieval consistently outperforms our 3DFSV approach without retrieval and the more challenging the task becomes, e.g. from 5-way to 24-way, the larger improvement retrieval approach can achieve on Kinetics, i.e. our retrieval-based method is better than our baseline method by 2.8% in 5-way (ours 3DFSV 92.5% vs our R-3DFSV 95.3% ) and the gap becomes 4.3% in 24-way (our 3DFSV 82.0% vs our R-3DFSV 86.3%). 8.4 experiments 137 5 10 15 24 # of novel classes (n-way) 80 85 90 95 T o p -1 A cc . (i n % ) Kinetics R-3DFSV 3DFSV 5 10 15 24 # of novel classes (n-way) 86 88 90 92 94 96 98 T o p -1 A cc . (i n % ) UCF101 R-3DFSV 3DFSV Figure 8.3: Results of 3DFSV and R-3DFSV on both Kinetics and UCF101 in the one-shot video classification setting (FSV). In this experiment we go beyond the classical 5-way classification setting. We use 5, 10, 15 and 24 (all) of the novel classes in each testing episode. We report the top-1 accuracy of novel classes. The trend with a decreasing accuracy by going from 5-way to 24-way indicates that the more realistic task on few-shot video classification has not yet been solved even with a spatiotemporal CNN. We hope that these results will encourage more progress in this challenging setting of many-way few-shot video classification setting. 8.4.4 Evaluating base and novel classes in GFSV The FSV setting has a strong assumption that test videos all come from novel classes. In contrast to the FSV, GFSV is more realistic and requires models to predict both base and novel classes in each testing episode. In other words, 64 base classes become distracting classes when predicting novel classes which makes the task more challenging. Intuitively, distinguishing novel and base classes is a challenging task because there are severe imbalance issues between the base classes with a large number of training examples and the novel classes with only few-shot examples. In this section, we evaluate our methods in the more realistic and challenging generalized few-shot video classification (GFSV) setting. In Table 8.3, on the Kinetics dataset, we observe a large performance gap between base and novel classes in both 1-shot and 5-shot cases, i.e., 3DFSV only achieves 7.5% on novel classes vs 88.7% on base classes. The reason is that predictions of novel classes are dominated by the base classes. Interestingly, our R-3DFSV improves 3DFSV on novel classes in both 1-shot and 5-shot cases, e.g., 7.5% of 3DFSV vs 13.7% of R-3DFSV in 1-shot. A similar trend can be observed on the UCF101 dataset. Those results demonstrate that our retrieval-based approach can alleviate the imbalance issues to some extent. At the same time, we find that generalized few-shot video classification (GFSV) setting, e.g. not restricting the test time search space only to novel classes but considering all of the classes even though base classes are 138 chapter 8. generalized many-way few-shot video classification Kinetics UCF101 Method novel base novel base 1-shot 3DFSV 7.5 88.7 3.5 97.1 R-3DFSV 13.7 88.7 4.9 97.1 5-shot 3DFSV 20.5 88.7 10.1 97.1 R-3DFSV 22.3 88.7 10.4 97.1 Table 8.3: Generalized few-shot video classification results on Kinetics and UCF101 in 5-way tasks. We report top-1 accuracy on both base and novel classes. PR SS RL VR BD BC Acc X 27.1 X 48.9 X X 51.9 X X 92.5 X X X 91.4 X X X X 93.2 X X X X 95.3 Table 8.4: Ablation study on 5-way 1-shot video classification task on the meta-test set of Kinetics. PR: pretrain R(2+1)D on Sports1M; SS: self-supervised model of AVTS (Korbar et al., 2018); RL: representation learning on base classes; VR: retrieve unlabeled videos with tags (Thomee et al., 2015); BD: batch denoising. BC: best clip selection. distracting, is still a challenging task and hope that this setting will attract interest of a wider community for future research. 8.4.5 Ablation study and retrieved clips In this section, we perform an ablation study to understand the importance of each component of our approach. After the ablation study, we evaluate the importance of the number of retrieved clips to the few-shot video classification (FSV) performance. Ablation study. We ablate our model in the 1-shot, 5-way video classification task on Kinetics dataset with respect to six critical parts including pretraining R(2+1)D on Sports1M (PR), self-supervised model of (Korbar et al., 2018) as the backbone (SS), representation learning on base classes (RL), video retrieval with tags (VR), batch denoising (BD) and best clip selection (BC). Table 8.4 shows the results. We start from a model with only a few-shot learning stage on novel classes. If a PR component is added to the model (first result row in Table (8.4), the newly- obtained model can achieve 27.1% accuracy which is only slightly better than random guessing performance (20%). It demonstrates that a pretrained 3D CNN alone is 8.4 experiments 139 0 1 2 5 8 10 # of retrieved clips per class 91 92 93 94 95 96 T o p -1 A cc . (i n % ) Kinetics 0 1 2 5 8 10 # of retrieved clips per class 94 94.5 95 95.5 96 96.5 97 T o p -1 A cc . (i n % ) UCF101 Figure 8.4: The effect of increasing the number of retrieved clips, left: on Kinetics, right: on UCF101. Both experiments are conducted on the one-shot, five-way classification task, reporting top-1 accuracy in the few-shot video classification (FSV) setting. not sufficient for a good performance. Besides, it also indicates that there exists a domain shift between the pretraining dataset, i.e. Sports1M, and our target Kinetics dataset. Adding RL component to the model (the second result row) means to train representation on base classes from scratch, which results in a worse accuracy of 48.9% compared to our full model. The primary reason for worse results is that optimizing the massive number of parameters of R(2+1)D is difficult on a train set consisting of only 6400 videos. Interestingly, if we adopt the self-supervised pretrained 3D CNN (MC3 pretrained on Kinetics without using any label) of (Korbar et al., 2018), i.e., SS, we immediate get 3.0% performance gains (the third result row) over training from random initialization. Adding both PR and RL components (the fourth row) obtains an accuracy of 92.5 which significantly boosts adding PR and RL components alone. Next, we study two critical components proposed in our retrieval approach. Comparing to our approach without retrieval (the fourth row), directly append- ing retrieved videos from YFCC100M (VR) to the few-shot training set of novel classes (the fifth result row) leads to 0.9% performance drop, while performing the batch denoising (the sixth row) in addition to VR obtains 0.7% gain. This implies that noisy labels from retrieved videos may hurt the performance but our batch denoising technique handles the noise well. Finally, adding the best clip selection (BC, the last row) after VR and BD gets a big boost of 2.8% accuracy. In summary, those ablation studies demonstrate the effectiveness of the six different critical parts in our approach. Influence of the number of retrieved clips. Intuitively, when the number of re- trieved clips increases, the retrieved videos become more diverse, but at the same time, the risk of obtaining negative videos becomes higher. We show the effectiveness 140 chapter 8. generalized many-way few-shot video classification Retrieved clipsQuery video Blasting sand Busking Dancing ballet Ice skating Paragliding Clarkeconner, sand Clarkeconner, sand day,dirt,hat,man,ro- cks,sand,walkway beach,sand beach,sand Buskers, nightlife Busking,londonist, music,musicians Busking,londonist, music,musicians Buskers,nightlife, winstonsalem Arnold,bandstand, busking,circus Ballet,dance, nutcracker Ballet Ballet Ballet Ballet Cold,ice,iceskating, Outdoor,skating Cold,ice,iceskating, Outdoor,skating Cold,ice,iceskating, Outdoor,skating Cold,ice,iceskating, Outdoor,skating Cold,ice,iceskating, Outdoor,skating paraglider paraglider paragliding paraglider poweredparagliding Baking,chop, kitchen,knife,towel Cake,cut,fondant, zombie Baking,chop, kitchen,knife,towel Baking,chop, kitchen,knife,towel Backyard,chickens, eating,watermelon Cutting watermelon Query video Retrieved clips Unboxing Play trumpet music , sussex, trombone, trumpet base , club, jazz Guitar, music base , club, jazz Guitar, music base , club, jazz Guitar, music base , club, jazz Guitar, music Asthma, gadgets, unboxing Makerbot, unboxing Asthma, gadgets, unboxing iphone, apple, unboxing Apple, macbook unboxing Figure 8.5: Top-5 retrieved video clips from YFCC100M for 8 novel classes on Kinetics. The left column is the class name with its one-shot query video and the right column shows the retrieved 16-frame video clips (middle frame is visualized) together with their users tags. Negative retrievals are marked in red. of our R-3DFSV with the increasing number of retrieved clips in Figure 8.4. On the Kinetics dataset (left of Figure 8.4), without retrieving any videos, the performance is 92.5%. As we increase the number of retrieved video clips for each novel class, the performance keeps improving and saturates at retrieving 8 clips per class, reaching an accuracy of 95.4%. On the UCF101 dataset (right of Figure 8.4), retrieving 1 clip gives us 1.6% gain. Retrieving more clips does not further improve the results, indicating more negative videos are retrieved. On the other hand, our batch denoising strategy is able to tolerate the noise to some extent. We observe a slight performance drop at retrieving 10 clips because the noise level becomes too high, i.e. there are 10 times more noisy labels than clean labels. 8.4.6 Qualitative results In Figure 8.5, we visualize the top-5 video clips we retrieve from YFCC100M dataset with video tag retrieval followed by the best clips selection. Here we only show 8 novel classes of Kinetics dataset due to the space limitation and visualization of other classes are in supplementary. We observe that the retrieved video clips of some classes are of high quality, meaning that those videos truly reveal the target novel classes. For instance, retrieved clips of class “Busking” are all correct because user tags of those videos consist of words like “buskers”, “busking” that are close to the class name, and the best clip selection can effectively filter out the irrelevant clips. It is intuitive those clips can potentially help to learn better novel class classifiers by supplementing the limited training videos. Failure cases are also common. For example, videos from the class “Cutting 8.5 conclusion 141 watermelon” do not retrieve any positive videos. The reasons can be that there are no user tags of cutting watermelon or our tag embeddings are not good enough. Those negative videos might hurt the performance if we treat them equally, which is why the batch denoising is critical to reduce the effect of negative videos. 8.5 conclusion In this work, we point out that a spatiotemporal CNN trained on a large-scale video dataset saturates existing few-shot video classification benchmarks. Hence, we propose new more challenging experimental settings, namely generalized few-shot video classification (GFSV) and few-shot video classification with more ways than the classical 5-way setting. We further improve spatiotemporal CNNs by leveraging the weakly-labelled videos from YFCC100M using weak-labels such as tags for text- supported and video-based retrieval. Our results show that generalized more-way few-shot video classification is challenging and we encourage future research in this setting. 9 C O N C L U S I O N S A N D F U T U R E P E R S P E C T I V E S Contents 9.1 Discussion of contributions . . . . . . . . . . . . . . . . . . . . . . . . 145 9.2 Future Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 9.2.1 Zero-shot image classification . . . . . . . . . . . . . . . . . . 148 9.2.2 Few-shot image classification . . . . . . . . . . . . . . . . . . 150 9.2.3 Zero-shot and few-shot learning beyond image classification 151 9.2.4 A broader view on the topic . . . . . . . . . . . . . . . . . . . 152 S ignificant progress has been made across various computer vision tasks in recent years. Deep neural networks have achieved great breakthrough in reliable object recognition for up to 1000 object categories (He et al., 2016), in widely-applicable activity recognition (Carreira and Zisserman, 2017) and in robust semantic image segmentation (Chen et al., 2018) for autonomous driving. Despite the success, training a deep neural network always requires a massive amount of labeled instances. In real world applications, labeled instances are often expensive and difficult to obtain because annotating data requires expert knowledge. Training a standard deep neural network on a small training set will lead to overfitting. It is thus of great importance to study the problems of learning with limited labeled data. This thesis aims to push the progress of the field by exploring how to transfer knowledge from known classes with enough labeled instances to novel classes with only limited labeled instances. More specifically, we focus on the following three directions, (1) zero-shot image classification where novel object classes have zero training examples, (2) few-shot image classification where each novel object class has only a few training examples, and (3) zero-shot and few-shot learning for semantic image segmentation and video action recognition. After a summary of the thesis with respect to the three directions in the following, we discuss our contributions and future perspectives. First, we examined zero-shot image classification. The goal of the task is to recognize novel object classes without observing any image instances of them by transferring knowledge from known to novel classes. In order to capture the complex correlation between image and semantic embedding spaces, we propose a piece- wise linear label embedding approach called LatEm that learns multiple linear transformation from image embedding space to the semantic embedding space. As there is no agreed upon zero-shot image classification benchmark, we first define a new benchmark by unifying both the evaluation protocols and data splits of publicly available datasets. We re-evaluate a significant number of methods on our 143 144 chapter 9. conclusions and future perspectives benchmark. Our analysis shows the status of the field and advocates to study the realistic generalized zero-shot learning problem where both known and novel classes are predicted during the test phase. To tackle the extreme data imbalance issue in generalized zero-shot learning, we introduce a feature generation framework, namely f-CLSWGAN, that synthesizes visual features for novel classes. We empirically show that f-CLSWGAN is effective to balance the base and novel class performance and the generated features can be applied to any zero-shot learning methods. Additionally, we extend f-CLSWGAN to a stronger version called f-VAEGAN-D2, which combines VAE and GANs for a better generative model and can learn from unlabeled data as well. The second direction of this thesis is concerned with few-shot image classification. The goal of the task is to recognize novel object classes after observing only a few instances of them. While human beings naturally have such ability, deep neural networks are difficult to be trained on a small training set due to the high risk of overfitting. While most of few-shot learning methods only rely on images of base class for knowledge transfer, we argue that semantic embeddings e.g., attributes, word embeddings and class hierarchy, provide complementary information that would benefit novel classes. Therefore, we extend our zero-shot learning approaches i.e., LatEm and f-VAEGAN-D2, to work in the few-shot learning setting. To this end, we generate few-shot learning splits on public datasets what are widely used for zero- shot learning. We show that our approaches have an edge over the standard linear classifier in few-shot image classification, indicating the benefits of using semantic embeddings. In addition, it is encouraging that our f-VAEGAN-D2 outperforms the state-of-the-art few-shot approaches on challenging large-scale few-shot benchmarks as well. Our experimental results also demonstrate that f-VAEGAN-D2 is able to obtain further improvement from unlabeled data . The third part of the thesis looks at zero-shot and few-shot learning tasks be- yond the image classification. More specifically, we tackle both semantic image segmentation and video action recognition with limited training examples. While most few-shot and zero-shot works are tackling image classification, there are little works on other computer vision tasks. To this end, we introduce the zero-label and few-label semantic segmentation problems and new data splits on public semantic segmentation datasets i.e., COCO-Stuff and Pascal-VOC. The task is to segment novel classes with few or zero instances. Inspired by our previous experience in zero-shot image classification, we develop a novel approach called SPNet, that projects each pixel into the semantic embedding space for knowledge transfer. Our SPNet can be incorporated into any semantic segmentation networks. We empirically show that it achieves decent results in zero-label setting and outperforms the state-of- the-art methods in the few-label setting. In addition, we study the few-shot video classification problem. We found that previous methods focus only on developing complicated few-shot methods but fail to adopt strong video representation that captures better temporal information. Our work shows that video representation with strong temporal modeling is critical for few-shot video classification. Moreover, we propose to leverage weakly-labeled videos from a large-scale video dataset to 9.1 discussion of contributions 145 expand the few-shot training set, leading to further improvement. In summary, this thesis defines a new zero-shot image classification benchmark. In order to improve the benchmark performance as well as few-shot image classifica- tion, we present a multi-modal learning approach and another two methods that generate synthetic visual features. We further tackle few-shot and zero-shot learning challenges for semantic segmentation and video action classification tasks. 9.1 discussion of contributions The goal of this thesis is to develop efficient methods to improve the performance of learning with limited labeled data. To this end, we study zero-shot and few-shot learning problems which aim to learn novel classes with zero or only a few training examples. In the following we will discuss the contributions and steps we made towards these goals and tasks with respect to the individual chapters. First, we presented a novel latent variable model, Latent Embeddings (LatEm), for learning a nonlinear (piece-wise linear) compatibility function for the task of zero-shot classification in Chapter 3 . LatEm is a multi-modal method, it uses images and class-level side-information either obtained through human annotation or in an unsupervised way from a large text corpus. LatEm incorporates multiple linear compatibility units and allows each image to choose one of them – such choices being the latent variables. We proposed a ranking based objective to learn the model using an efficient and scalable SGD based solver. We empirically validated our model on three challenging benchmark datasets for zero-shot classification of Birds, Dogs and Animals. We improved the state-of-the-art for zero-shot learning using unsupervised class embeddings i.e., word embeddings, on AWA and on two fine-grained datasets (CUB and Stanford Dogs). On AWA, we also improve the accuracy obtained with supervised class embeddings i.e., human-annotated attributes. This demonstrates quantitatively that our method learns a latent structure in the embedding space through multiple compatibility units. We also presented a qualitative analysis of our results and showed that the latent embeddings learned with our method leads to visual consistencies. We proposed a new method for selecting the number of latent variables automatically from the data by pruning. Such pruning based method speeds up the training and leads to models with competitive space-time complexities compared to the cross-validation based method. We further extended our application domain to generalized zero-shot and generalized few-shot learning setting where at training time we assume the availability of either no or a few labeled samples from unseen classes. On the other hand, both at training and test time the search space includes all the class embeddings from seen and unseen classes. As expected, our evaluation on generalized zero-shot learning setting showed a significant loss of accuracy compared to the standard zero-shot learning setting which we analyzed through visualizations and quantitative results. Our evaluation on generalized few-shots setting showed that with as few as two to ten samples from unseen classes, unsupervised class embeddings can outperform the supervised 146 chapter 9. conclusions and future perspectives attributes. Therefore, with increasing number of additional training samples, the difference between different class embeddings are reduced. Second, in Chapter 4, we evaluated a significant number of state-of-the-art zero-shot learning methods, i.e. (Lampert et al., 2013; Zhang and Saligrama, 2015; Xian et al., 2016; Akata et al., 2015c; Romera-Paredes et al., 2015; Changpinyo et al., 2016; Socher et al., 2013; Norouzi et al., 2014; Frome et al., 2013; Akata et al., 2015a; Kodirov et al., 2017; Verm and Rai, 2017; Ye and Guo, 2017), on several datasets, i.e. SUN, CUB, AWA1, AWA2, aPY and ImageNet, within a unified evaluation protocol both in zero-shot and generalized zero-shot settings. Our evaluation showed that generative models and compatibility learning frameworks have an edge over learning independent object or attribute classifiers and also over other hybrid models for the classic zero-shot learning setting. We observed that unlabeled data of unseen classes can further improve the zero-shot learning results, thus it is not fair to compare transductive learning approaches with inductive ones. We discovered that some standard zero-shot dataset splits may treat feature learning disjoint from the training stage as several test classes are included in the ImageNet1K dataset that is used to train the deep neural networks that act as feature extractor. Therefore, we proposed new dataset splits making sure that none of the test classes in none of the datasets belong to ImageNet1K. Moreover, disjoint training and validation class split is a necessary component of parameter tuning in zero-shot learning setting. In addition, we introduced a new Animal with Attributes (AWA2) dataset. AWA2 inherits the same 50 classes and attributes annotations from the original Animal with Attributes (AWA1) dataset, but consists of different 37, 322 images with publicly available redistribution license. Our experimental results showed that the 12 methods that we evaluated perform similarly on AWA2 and AWA1. Moreover, our statistical consistency test indicated that AWA1 and AWA2 are compatible with each other. Finally, including training classes in the search space while evaluating the methods, i.e. generalized zero-shot learning, provides an interesting playground for future research. Although the generalized zero-shot learning accuracy obtained with 13 models compared to their zero-shot learning accuracy is significantly lower, the relative performance comparison of different models remain the same. In summary, our work extensively evaluated the good and bad aspects of zero-shot learning while sanitizing the ugly ones. Third, in Chapter 5, we propose f-CLSWGAN, a learning framework for feature generation followed by classification, to tackle the generalized zero-shot learning task. Our f-CLSWGAN model adapts the conditional GAN architecture that is frequently used for generating image pixels to generate CNN features. In f-CLSWGAN, we improve WGAN by adding a classification loss on top of the generator, enforcing it to generate features that are better suited for classification. In our experiments, we have shown that generating features of unseen classes allows us to effectively use softmax classifiers for the GZSL task. Our framework is generalizable as it can be integrated to various deep CNN architectures, i.e. GoogleNet and ResNet as a pair of the most widely used architectures. It can also be deployed with various classifiers, e.g. ALE, SJE, DEVISE, LATEM, ESZSL that constitute the state of the 9.1 discussion of contributions 147 art for ZSL but also the GZSL accuracy improvements obtained with softmax is important as it is a simple classifier that could not be used for GZSL before this work. Moreover, our features can be generated via different sources of class embeddings, e.g. Sentence, Attribute, Word2vec, and applied to different datasets, i.e. CUB, FLO, SUN, AWA being fine and coarse-grained ZSL datasets and ImageNet being a truly large-scale dataset. Finally, based on the success of our framework, we motivated the use of GZSL tasks as an auxiliary method for evaluation of the expressive power of generative models in addition to manual inspection of generated image pixels which is tedious and prone to errors. For instance, WGAN (Gulrajani et al., 2017) has been proposed and accepted as an improvement over GAN (Goodfellow et al., 2014). This claim is supported with evaluations based on manual inspection of the images and the inception score. Our observations in Figure 5.4 and in Figure 5.6 support this and follow the same ordering of the models, i.e. WGAN improves over GAN in ZSL and GZSL tasks. Hence, while not being the primary focus of this chapter, we strongly argue, that ZSL and GZSL are suited well as a testbed for comparing generative models. Fourth, in Chapter 6 , we develop a transductive feature generating framework that synthesizes CNN image features from a class embedding. Our generated features circumvent the scarceness of the labeled training data issues and allow us to effectively train softmax classifiers. Our framework combines conditional VAE and GAN architectures to obtain a more robust generative model. We further improve VAE-GAN by adding a non-conditional discriminator that handles unlabeled data from unseen classes. The second discriminator learns the manifold of unseen classes and backpropagates the WGAN loss to feature generator such that it generalizes better to generate CNN image features for unseen classes. Our feature generating framework is effective across zero-shot (ZSL), generalized zero-shot (GZSL), few-shot (FSL) and generalized few-shot learning (GFSL) tasks on CUB, FLO, SUN, AWA and large-scale ImageNet datasets. Finally, we show that our generated features are visually interpretable, i.e. the generated images by by inverting features into raw image pixels achieve an impressive level of detail. They are also explainable via language, i.e. visual explanations generated using our features are class-specific. Fifth, in Chapter 7, we propose SPNet to semantically segment novel classes with no labeled examples or with only a few samples, within the new tasks of zero-label semantic segmentation and few-label semantic segmentation respectively. This model consists of a visual-semantic embedding module that encodes images in the word embedding space and a semantic projection layer that produces class probabilities. Our SPNet is both conceptually and computationally simple but surprisingly effective and end-to-end trainable. We have shown its applicability across zero-shot image classification to zero-label and few-label semantic segmentation tasks on various benchmark datasets. Finally, in Chapter 8, we point out that a spatiotemporal CNN trained on a large-scale video dataset saturates existing few-shot video classification benchmarks. Hence, we propose new more challenging experimental settings, namely generalized few-shot video classification (GFSV) and few-shot video classification with more 148 chapter 9. conclusions and future perspectives ways than the classical 5-way setting. We further improve spatiotemporal CNNs by leveraging the weakly-labelled videos from YFCC100M using weak-labels such as tags for text-supported and video-based retrieval. Our results show that generalized more-way few-shot video classification is challenging and we encourage future research in this setting. 9.2 future perspectives The content of this thesis mainly focuses on establishing benchmark and tackling imbalanced issues for few-shot and zero-shot learning in various computer vision applications. Despite the progress we achieved, few-shot and zero-shot learning are still not saturating. In the following we first discuss items of future work with respect to the different directions of the thesis. In the last section we give a broader outlook for the field. 9.2.1 Zero-shot image classification Most of zero-shot learning methods as well as proposed approaches in this thesis rely on deep representation that is pretrained or finetuned following the standard supervised learning setting. We postulate there exists special image representation that is more efficient for zero-shot learning. In addition, as semantic embeddings play an important role in zero-shot learning, it is promising to explore better unsupervised semantic embeddings rather than annotating attributes. We layout the following directions for future work. Explainable zero-shot learning. This thesis has been adopting human annotated attributes for several datasets i.e. CUB, AWA and SUN. While decent zero-shot results have been achieved with the attributes, we still lack an explainable approach that tells us how the zero-shot prediction is made. One possible way to improve the visual explainability is by localizing semantic parts i.e., “head of a bird”, “beak of a bird”, etc. Previous works (e.g. Zhang et al., 2016b, 2014) directly tackle the bird part detection problem by using the part annotation, which is expensive to obtain. In a future work, we are interested in introducing new intermediate layers into a CNN architecture such that bird parts can be localised using only class-level attributes. We believe such representation network will naturally have better interpretibility and potentially lead to better fine-grained zero-shot learning performance due to its better locality. Improving locality and compositionality of the image representation Zero-shot learn- ing aims to achieve generalization on novel tasks. However, most of existing zero-shot learning works rely on the standard CNNs, which has a different goal of achieving the same task generalization. In a future work, we are interested in exploring special representation learning framework for zero- shot learning. We are inspired by (Sylvain et al., 2019) which points out that 9.2 future perspectives 149 locality and compositionality are the two representation learning principles that attribute to a good performance in zero-shot learning. Local features have been widely used in computer vision for a long history. The traditional hand-crafted features e.g., SIFT (Lowe, 2004), SURF (Bay et al., 2008), extract statistics within local patches in an image and aggregate them to form a global image representation. Similarly, CNNs (LeCun et al., 2015) perform convolution operation on local patches in the images followed by some non-liearity and pooling. By stacking multiple such convolutional layers, CNNs increase its receptive and get more global features. Local features can beneficial to novel task generalization because local information is often shared by many classes. On the contrary, global information is often category-specific and requires a lot of training examples to learn the within-class variations. Another direction is to explore compositionality of the representation. The key insight is that the representation will be able to encode classes more efficiently if representation is compositional of visual primitives. The challenge is that how we define the compositional function and how we learn visual primitives. Compositional zero-shot learning. Most of the existing zero-shot learning works rely on attribute annotation to achieve the best performance. In real-world applications, attribute annotation is often not available. Compositional zero- shot learning (Purushwalkam et al., 2019) is a special zero-shot learning problem where attribute annotation is not available, but visual concepts are assumed to be composed by an adjective and an object e.g. “red apple” and “green apple”. The goal is to predict novel visual concepts that are unseen compositions of existing adjective and objects. Interesting research ideas could be to explore how our feature generation idea can be adapted to this problem and how we learn compositional representation. Graph convolutional networks (GCN) for large-scale zero-shot image classification. The zero-shot learning performance on the large-scale ImageNet is limited by the weakness of noisy word embeddings. Recently (Wang et al., 2018b) signifi- cantly improves the large-scale zero-shot learning performance by adopting a graph CNN (Kipf and Welling, 2017) on the wordnet hierarchy. But (Wang et al., 2018b) simply takes as input the original class hierarchy, ignoring the special tree structure of the wordnet and visual similarities. The GCN used in (Wang et al., 2018b) also has over-smoothing issues. Therefore, we are interesting in exploring a better graph construction method and a new graph convolutional neural network technique for the large-scale zero-shot learning performance. Learning unsupervised semantic embeddings. It is clear in this thesis that the se- mantic embeddings play a critical role in zero-shot learning performance. Attributes often achieve the best results but they require expert knowledge to annotate. Unsupervised word embeddings i.e., word2vec (Mikolov et al., 2013b) and glove (Pennington et al., 2014), are easier to obtain but it has a big performance gap behind the attributes. Recently, a new language model 150 chapter 9. conclusions and future perspectives called BERT (Devlin et al., 2018) has created new state-of-the-art on a wide range of NLP tasks. We believe it is promising to enhance the unsupervised embeddings by incorporating BERT (Devlin et al., 2018). 9.2.2 Few-shot image classification Both zero-shot and few-shot learning share the same goal of novel task generalization. Therefore, we believe technique that work for zero-shot learning can potentially work well in few-shot learning as well. For this reason, it is interesting to investigate image representation with better locality and compositionality for few-shot learning. In addition to that, we would consider the following topics as promising directions. Cross-domain few-shot learning. Significant improvement has been made in the few-shot learning setting where both base and novel classes belong to the same dataset i.e., Mini-ImageNet and Omniglot. However, in many real-world applications, novel classes are likely from a different domain. For example, if the target novel classes belong to the medical image domains, it is difficult to collect sufficient amount of base class data from the same domain. Therefore, we consider that learning to learn adaptation with limited labeled data would be an important direction for future few-shot learning research. Unlabeled data from novel classes could potentially help to domain adaptation. Generalized few-shot learning. Majority of few-shot learning methods are evalu- ated in the meta-learning setup where a new set of classes is sampled from all the novel classes in each episode and the goal is to improve the novel class accuracy over many episodes. However, such evaluation protocol is not realistic because it ignores the base classes. In real-world applications, we are interested in the generalized few-shot learning where the model has to predict both base and novel classes. Similar setting in zero-shot learning has attracted increasing attention, but there are not much few-shot learning works that tackle this problem. We believe it is an important direction as well. Semi-supervised few-shot learning. While obtaining labeling data is difficult, un- labeled data is often easy to collect. Therefore, it is of great importance to study semi-supervised few-shot learning field where the training set consists of few-shot labeled examples and a large number of unlabeled examples. Pre- vious approaches are limited to adopt the classical semi-supervised learning technics like label propogation or semi-SVM. We are interested in combining a few-shot learning objects on the labeled data with self-supervised learning ob- jectives on unlabeled data. Given the success of recent self-supervised learning approaches (e.g. Chen et al., 2020; He et al., 2019), we believe those technics would benefit few-shot learning. Meta-learning. Meta-learning or learning to learn, is a popular subfield of few- shot learning. The key insight is to exploit training classes for the purpose 9.2 future perspectives 151 of learning “a meta procedure”, e.g., initialization, optimization algorithm, that generalizes well to novel classes. This concept sounds appealing, but we concern the limitation of their evaluation setting. More specifically, most of papers are only evaluated on 5 classes with 1 or 5 samples per class in each episode. Recently, (Triantafillou et al., 2019) proposes a new large-scale meta-dataset that addresses those issues. We think it is interesting to work on meta-learning field on this more realistic benchmark. Bayesian few-shot learning. Most of few-shot learning approaches produce a single model after learning from only a small amount of training examples. However, there are a lot of uncertainties about the novel classes due to the small training set, resulting ambiguous description of novel classes. It is impossible that a single model could achieve accurate results on those novel classes. We believe that Bayesian learning could address the ambiguity issues by learning a distribution of models for novel classes. Unfortunately, previous Bayesian few-shot approaches (e.g. Gordon et al., 2018; Yoon et al., 2018; Finn et al., 2018) still do not achieve state-of-the-art results on Mini-ImageNet and recent realistic meta-learning benchmark (Triantafillou et al., 2019). It would be important to further push the performance of Bayesian approaches such that they are more appealing in practice. 9.2.3 Zero-shot and few-shot learning beyond image classification In addition to the image classification, there are many other computer vision applica- tions naturally facing the few-shot learning problems. Here we list a few applications we are interested in. Learning stronger temporal information for few-shot videos classification. Our ap- proach for few-shot video classification does not capture long-term temporal information, which can be critical for recognizing actions. We are currently working on a project that aims to learn long-term temporal correlation in video through self-attention (Vaswani et al., 2017). Although the self-attention has been well established in the standard setting, it is not trivial on how to extend it to the few-shot learning setting. Few-shot learning for medical image analysis. Medical image analysis has always been an important field of computer vision research. The tasks for medical images analysis include image segmentation, computer-aided disease diag- nosis, and image registration for scanned images from CT, fMRI, and X-ray. CheXNet (Rajpurkar et al., 2017) achieves radiologists-level pneumonia detec- tion performance by learning a deep CNN on a large-scale chest X-ray dataset. However, such large-scale medical image dataset is not always feasible due to the huge cost of collecting medical images. For novel diseases or other medical image tasks, the few-shot learning challenges remain there. We are excited to extend our expertise in few-shot learning to disease diagnosis from medical 152 chapter 9. conclusions and future perspectives images. In particular, we plan to investigate knowledge transfer technics for novel diseases. Improving zero-label and few-label semantic segmentation. This thesis has made the first step towards the zero-label and few-label semantic segmentation problems. While we have shown that a semantic project layer followed by the cross-entropy loss works well, we believe that exploring better loss functions is likely to lead to big improvements in the predictions. Furthermore, we found that the performance of generalized zero-label semantic segmentation is still unsatisfied, we believe that exploring better semantic embeddings and special normalization technics are promising directions for this issue. Few-shot 3D computer vision. 3D computer vision is a critical field for virtual reality, robotics and autonomous driving because the real world is obviously in 3D. Typical 3D vision tasks include 3D reconstruction, 3D human body modeling and 3D scene understanding like detection and tracking problems. Although deep learning has achieved big breakthrough in 2D vision , we have not seen the same progress in 3D vision because collecting and processing 3D training data are difficult. We do not have much expertise in 3D vision and it is hard to suggest any good ideas but we are definitely interested in studying it in the near future. 9.2.4 A broader view on the topic Our long-term goal is to develop machine perception that can generalized well after observing only limited labeled examples of novel tasks. Few-shot and and zero-shot learning are simply two directions towards this goal. From a broader view, topics of learning with limited labeled data include but not limited to self-supervised learning, and long-tailed recognition problem and multi-modal learning. Semi-supervised and self-supervised learning. Semi-supervised and self-supervised learning are both two practical solutions for learning with limited labeled data. While semi-supervised learning leverages unlabeled data in addition to labeled data, self-supervised learning learns from a completely unlabeled dataset by solving other proxy tasks that make use of the structure of the input data. I am interested in develop an efficient learning algorithm that combines low-shot learning, semi-supervised learning and self-supervised learning. Long-tailed recognition problem. Real-world datasets inherently follow a long-tail distribution i.e., the number of samples per class is decreasing exponentially. A reliable visual recognition system should perform well on all the classes by balancing the dataset and transferring knowledge from known classes to novel classes. This is a very challenging task because it must handle imbalanced clas- sification and low-shot learning at the same time. I believe developing robust novelty detection algorithms, special sampling methods, and normalization technics to calibrate the prediction are promising directions. 9.2 future perspectives 153 Multi-modal learning. Learning from multiple modalities of data has been shown to the amount of necessary training instances because different modalities often contain complementary information. In fact, human beings learn from multiple sensory modalities i.e., the five classic types of human perception are senses of vision (sight), audition (hearing), tactile stimulation (touch), olfaction (smell), and gustation (taste). While there have been a lot of studies in learning with vision and language, little research has been done in combining those five sensory modalities (or subsets of them). I feel it hold the potential to improve self-supervised learning by predicting the correspondence between two or multiple modalities. L I S T O F F I G U R E S 1.1 In almost all real-wold settings, the number of samples per category follows a skewed distribution i.e. a few categories have a large num- ber of samples while most of categories have only a small number of samples (as shown in the left figure). The scarcity of samples results in poor generalization performance of the powerful deep learn- ing methods which often require a huge number of labeled data to train. In this thesis, we address the challenges when learning with limited labeled data in the scenarios of image classification (e.g. He et al., 2016), semantic segmentation (e.g. Long et al., 2015) and video classification (e.g. Tran et al., 2018). . . . . . . . . . . . . . . . . . . . . 2 3.1 Compatibility learning frameworks that use a linear projection, e.g. SJE Akata et al. (2015c) (figure on the left) may lead to a large projection error, however learning a piece-wise linear model (figure on the right) leads to more precise projections. Here, crosses represent image embeddings and their projections on the class embedding space, W are the parameters of the compatibility function, solid circles represent the ground truth class embedding. . . . . . . . . . . . . . . . . . . . . . 29 3.2 Effect of latent variable K on CUB, AWA and Dogs datasets. We measure Top-1 Accuracy (in %) with the increasing number of latent models, i.e. K, learned with unsupervised class embeddings, i.e. w2v, glo, hie. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.3 Top images ranked by the matrices using word2vec, glove, hierarchy and attribute class embeddings on CUB dataset, each row corresponds to different matrix in the model. Qualitative examples support our intuition – each latent variable captures certain visual aspects of the bird. Note that, while the images may not belong to the same fine- grained class, they share common visual properties. . . . . . . . . . . . 43 3.4 Left: Confusion matrix of all the classes on AWA dataset based on the latent factors learned using LatEm in the general setting (we use glo as class embedding). 10 unseen classes are shown at the top of the confusion matrix. Right: t-SNE visualization of the confusion matrix with seen and unseen classes marked with blue and red respectively. Visually similar classes such as chimpanzee and gorilla are embedded close to each other, hence being confused by the classifier. . . . . . . . 45 3.5 Generalized zero- and few-shots learning settings evaluated on all for CUB, AWA and Dogs using att (where available), w2v, glo and hie embeddings. We show the Top-1, Top-5 and top-10 Accuracy (in%) with the increasing number of images per unseen class used during training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 155 156 list of figures 4.1 Zero-shot learning (ZSL) vs generalized zero-shot learning (GZSL): At training time, for both cases the images and attributes of the seen classes (Ytr) are available. At test time, in the ZSL setting, the learned model is evaluated only on unseen classes (Yts) whereas in GZSL setting, the search space contains both training and test classes (Ytr ∪Yts). To facilitate classification without labels, both tasks use some form of side information, e.g. attributes. The attributes are annotated per class, therefore the labeling cost is significantly reduced. 53 4.2 Comparing AWA1 (Lampert et al., 2013) and our AWA2 in terms of number of images (Left) and t-SNE embedding of the image features (the embedding is learned on AWA1 and AWA2 simultaneously, there- fore the figures are comparable). AWA2 follows a similar distribution as AWA1 and it contains more examples. . . . . . . . . . . . . . . . . . 61 4.3 Robustness of 10 methods evaluated on SUN, CUB, AWA1, aPY using 3 validation set splits (results are on the same test split). Top: original split, Bottom: proposed split (Image embeddings = ResNet). We measure top-1 accuracy in %. . . . . . . . . . . . . . . . . . . . . . . . . 69 4.4 Ranking 12 models by setting parameters on three validation splits on the standard (SS, left) and proposed (PS, right) setting. Element (i, j) indicates number of times model i ranks at jth over all 4 × 3 observations. Models are ordered by their mean rank (displayed in brackets). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.5 Zero-Shot Learning experiments on Imagenet, measuring Top-1, Top- 5 and Top-10 accuracy. 2/3 H = classes with 2/3 hops away from ImageNet1K training classes (Ytr), M500/M1K/M5K denote 500, 1K and 5K most populated classes, L500/L1K/L5K denote 500, 1K and 5K least populated classes, All = The remaining 20K categories of ImageNet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.6 GZSL on Imagenet, measuring Top-1, Top-5 and Top-10 accuracy. 2/3H: classes with 2/3 hops away from ImageNet1K Ytr, M500/M1K/M5K: 500/1K/5K most populated classes, L500/L1K/L5K: 500/1K/5K least populated classes, All: Remaining 20K classes. . . . . . . . . . . . . . . 75 4.7 Ranking 13 models on the proposed split (PS) in generalized zero-shot learning setting. Top-Left: Top-1 accuracy (T1) is measured on unseen classes (ts), Top-Right: T1 is measured on seen classes (tr), Bottom: T1 is measured on Harmonic mean (H). . . . . . . . . . . . . . . . . . . . . 76 4.8 Zero-shot (left) and generalized zero-shot learning (right) results in the transductive learning setting on our Proposesd Split. . . . . . . . . 77 list of figures 157 5.1 CNN features can be extracted from: 1) real images, however in zero- shot learning we do not have access to any real images of unseen classes, 2) synthetic images, however they are not accurate enough to improve image classification performance. We tackle both of these problems and propose a novel attribute conditional feature generating adversarial network formulation, i.e. f-CLSWGAN, to generate CNN features of unseen classes. . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.2 Our f-CLSWGAN: we propose to minimize the classification loss over the generated features and the Wasserstein distance with gradient penalty. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.3 Zero-shot learning results when comparing f-xGAN versions with f-GMMN as well as comparing multimodal embedding methods with softmax. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.4 Generalized zero-shot learning results when comparing f-xGAN ver- sions with f-GMMN as well as comparing multimodal embedding methods with softmax. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.5 Measuring the seen class accuracy of the classifier trained on generated features of seen classes w.r.t. the training epochs (with softmax). . . . 90 5.6 Increasing the number of generated f-xGAN features wrt unseen class accuracy (with softmax) in ZSL. . . . . . . . . . . . . . . . . . . . . . . . 91 5.7 ZSL and GZSL results on ImageNet (ZSL: T1 on Yu, GZSL: T1 on Yu). The splits, ResNet features and Word2Vec are provided by (Xian et al., 2017). “Ours” = feature generator: f-CLSWGAN, classifier: softmax. . 92 6.1 Our any-shot feature generating framework learns discriminative and interpretable CNN features from both labeled data of seen and unlabeled data of novel classes. . . . . . . . . . . . . . . . . . . . . . . 96 6.2 Our any-shot feature generating network (f-VAEGAN-D2) consist of a feature generating VAE (f-VAE), a feature generating WGAN (f-WGAN) with a conditional discriminator (D1) and a transductive feature generator with a non-conditional discriminator (D2) that learns from both labeled data of seen classes and unlabeled data of novel classes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.3 Top-1 ZSL results on ImageNet. We follow the splits in (Xian et al., 2019b) and compare our results with the state-of-the-art feature gener- ating model CLSWGAN (Xian et al., 2018). . . . . . . . . . . . . . . . . 103 6.4 Few-Shot Learning (FSL) results on CUB and FLO with increasing number of training samples per novel class. We report the top-1 accuracy on novel classes. . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.5 Generalized Few-Shot Learning (GFSL) results on CUB and FLO with increasing number of training samples per novel class. We report the top-1 accuracy on all classes. . . . . . . . . . . . . . . . . . . . . . . . . 105 6.6 Few Shot Learning results on ImageNet with increasing number of training samples per novel class (Top-5 Accuracy). Left: FSL setting, Right: GFSL setting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 158 list of figures 6.7 Interpretability: visualizations by generating images and textual ex- planations from real or synthetic features. For every block, the top is the target, the middle is reconstructed from the real feature (R) of the target, the bottom is reconstructed from a synthetic feature (S) from the same class. We also generate visual explanations conditioned with the predicted class and the reconstructed real or synthetic images. Top (Middle): Features come from seen (unseen) classes. Bottom: classes with a large inter-class variation lead to poorer visualizations and explanations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 7.1 We propose (generalized) zero- and few-label semantic segmentation tasks, i.e. segmenting classes whose labels are not seen by the model during training or the model has a few labeled samples of those classes. To tackle these tasks, we propose a model that transfers knowledge from seen classes to unseen classes using side information, e.g. semantic word embedding trained on free text corpus. . . . . . . . 110 7.2 Our zero-label and few-label semantic segmentation model, i.e. SP- Net, consists of two steps: visual semantic embedding and semantic projection. Zero-label semantic segmentation is drawn as an instance of our model. Replacing different components of SPNet, four tasks are addressed (Solid/dashed lines show the training/test procedures respectively). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 7.3 mIoU of unseen classes on COCO-Stuff ordered wrt average object size (left to right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 7.4 GZLSS results on COCO-Stuff and PASCAL-VOC. We report mean IoU of unseen classes, seen classes and their harmonic mean (perception model is based on ResNet101 and the semantic embedding is ft + w2v). SPNet-C represents SPNet with calibration. . . . . . . . . . . . . . . . . 120 7.5 Few-label semantic segmentation (FLSS) on COCO-Stuff and PASCAL VOC with increasing number of training samples per class, i.e. n ∈ {1, 2, 5, 10, 20}. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 7.6 Generalized few-label semantic segmentation (GFLSS) on COCO-Stuff and PASCAL VOC with increasing number of training samples per class, i.e. n ∈{1, 2, 5, 10, 20}. . . . . . . . . . . . . . . . . . . . . . . . . . 122 7.7 Qualitative results of our SPNet in 0-, 1- and 5-label semantic seg- mentation settings on COCO-Stuff on 15 novel classes (color coded at the top). Base classes are masked out with black color. (a) promising results (b) failure cases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 8.1 Leveraging the lack of class-labeled videos (time-consuming to ob- tain) with tag-labeled videos, few-shot videos and text, our 3D CNN saturates existing benchmarks and enables the more challenging gen- eralized few-shot multi-way video classification task. . . . . . . . . . . 126 list of figures 159 8.2 Our approach is composed of three steps: representation learning, few-shot learning and testing. In representation learning, we train a R(2+1)D from the random initialization or Sports1M-pretrained model on the base classes of our target dataset. In few-shot learning, given few-shot support videos from novel classes, we first retrieve a list of candidate videos for each class from YFCC100M (Thomee et al., 2015) using their tags, followed by selecting the best matching short clips from the retrieved videos using visual features. Those clips serve as additional training examples to learn classifiers that generalize to novel classes at test time. . . . . . . . . . . . . . . . . . . . . . . . . . . 130 8.3 Results of 3DFSV and R-3DFSV on both Kinetics and UCF101 in the one-shot video classification setting (FSV). In this experiment we go beyond the classical 5-way classification setting. We use 5, 10, 15 and 24 (all) of the novel classes in each testing episode. We report the top-1 accuracy of novel classes. . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 8.4 The effect of increasing the number of retrieved clips, left: on Kinetics, right: on UCF101. Both experiments are conducted on the one-shot, five-way classification task, reporting top-1 accuracy in the few-shot video classification (FSV) setting. . . . . . . . . . . . . . . . . . . . . . . 139 8.5 Top-5 retrieved video clips from YFCC100M for 8 novel classes on Ki- netics. The left column is the class name with its one-shot query video and the right column shows the retrieved 16-frame video clips (middle frame is visualized) together with their users tags. Negative retrievals are marked in red. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 L I S T O F T A B L E S Tab. 3.1 The statistics of CUB, AWA and Dogs datasets in zero-shot setting. CUB and Dogs are fine-grained datasets whereas AWA is a more general concept dataset. Ytr+v and Yts are seen and unseen class embeddings respectively. . . . . . . . . . . . . . . . . . . . . . . . . 36 Tab. 3.2 The statistics of CUB, AWA and Dogs datasets in the generalized zero-shot learning setting. . . . . . . . . . . . . . . . . . . . . . . . 36 Tab. 3.3 Average per-class top-1 accuracy in zero-shot setting on AWA, CUB and Dogs datasets. We compare ESZSL (Romera-Paredes et al., 2015), ESZSL* (Romera-Paredes et al., 2015), CMT (Socher et al., 2013), SSE (Zhang and Saligrama, 2015), JLSE (Zhang and Saligrama, 2016), SJE (Akata et al., 2015c) and Latent Embedding model (K is cross-validated) using the same splits, image and class embeddings as in (Akata et al., 2015c). . . . . . . . . . . . . . 37 Tab. 3.4 Number of matrices selected using pruning (PR) and using cross- validation (CV). PR is obtained by K0 = 16. . . . . . . . . . . . . 38 Tab. 3.5 Class embeddings combined as in (Akata et al., 2015c) (cnc: early fusion of class embeddings, cmb: late fusion of scores). . . . . . . 39 Tab. 3.6 Average per-class top-1 accuracy on unseen classes (the results are averaged on five folds). SJE: (Akata et al., 2015c), LatEm: Latent embedding model (K is cross-validated). . . . . . . . . . . 41 Tab. 3.7 Average per-class top-1 accuracy on unseen classes (averaged over five zero-shot splits that we used in the stability experiments). PR: proposed model learnt with pruning using K0 = 16, CV: with cross validation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Tab. 3.8 Average per-class top-1, 5 and 10 accuracy, i.e. T1, T5 and T10 respectively, in generalized zero-shot learning setting when we have no samples from Yts during training, however the search space during testing includes all the available labels, i.e. namely Y = Ytr ∪Yv ∪Yts. . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Tab. 4.1 Statistics for SUN (Patterson and Hays, 2012), CUB (Welinder et al., 2010), AWA1 (Lampert et al., 2013), proposed AWA2, aPY (Farhadi et al., 2009) in terms of size, granularity, number of attributes, number of classes in Ytr and Yts, number of images at training and test time for standard split (SS) and our proposed splits (PS). 63 Tab. 4.2 Reproducing zero-shot results with methods that have a public implementation: O = Original results, R = Reproduced using provided image features and code. We measure top-1 accuracy in %. −: image features are not provided in the original paper for this dataset. Top: ZSL, Bottom: transductive ZSL. . . . . . . . 66 161 162 list of tables Tab. 4.3 Zero-shot learning results on SUN, CUB, AWA1, AWA2 and aPY using SS = Standard Split, PS = Proposed Split with ResNet features. The results report top-1 accuracy in %. . . . . . . . . . . 68 Tab. 4.4 Cross-dataset evaluation over AWA1 and AWA2 in zero-shot learning setting on the Proposed Splits: Left of the colon indicates the training set and right of the colon indicates the test set, e.g. AWA1:AWA2 means that the model is trained on the train set of AWA1 and evaluated on the test set of AWA2. We measure top-1 accuracy in %. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Tab. 4.5 ImageNet with different splits: 2/3 H = classes with 2/3 hops away from the Ytr of ImageNet1K, 500/1K/5K most populated classes, 500/1K/5K least populated classes, All = The remaining 20K categories of ImageNet (Yts). We measure top-1 accuracy in %. 72 Tab. 4.6 Generalized Zero-Shot Learning on Proposed Split (PS) measur- ing ts = Top-1 accuracy on Yts, tr=Top-1 accuracy on Ytr, H = harmonic mean (CMT*: CMT with novelty detection). We measure top-1 accuracy in %. . . . . . . . . . . . . . . . . . . . . . 74 Tab. 5.1 CUB, SUN, FLO, AWA datasets, in terms of number of attributes per class (att), sentences (stc), number of classes in training + validation (Ys) and test classes (Yu). . . . . . . . . . . . . . . . . . 86 Tab. 5.2 ZSL measuring per-class average Top-1 accuracy (T1) on Yu and GZSL measuring u = T1 on Yu, s = T1 on Ys, H = harmonic mean (FG=feature generator, none: no access to generated CNN features, hence softmax is not applicable). f-CLSWGAN signifi- cantly boosts both the ZSL and GZSL accuracy of all classification models on all four datasets. . . . . . . . . . . . . . . . . . . . . . . 87 Tab. 5.3 GZSL results with GoogLeNet vs ResNet-101 features on CUB (CNN: Deep Feature Encoder Network, FG: Feature Generator, u = T1 on Yu, s = T1 on Ys, H = harmonic mean, “none”= no generated features). . . . . . . . . . . . . . . . . . . . . . . . . . . 90 Tab. 5.4 GZSL results with conditioning f-xGAN with stc and att on CUB (C: Class embedding, FG: Feature Generator, u = T1 on Yu, s = T1 on Ys, H = harmonic mean, “none”= no generated features). . 91 Tab. 5.5 Summary Table (u = T1 on Yu, s = T1 accuracy on Ys, H = harmonic mean, class embedding = stc). “none”: ALE with no generated features. . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Tab. 6.1 Ablating different generative models on CUB (using attribute class embedding and image features with no fine-tuning). ZSL: top-1 accuracy on unseen classes, GZSL: harmonic mean of seen and unseen class accuracies. . . . . . . . . . . . . . . . . . . . . . 101 list of tables 163 Tab. 6.2 Comparing with the-state-of-the-art. Top: inductive methods (IND), Bottom: transductive methods (TRAN). Fine tuning is performed only on seen class images as this does not violate the zero-shot condition. We measure top-1 accuracy (T1) in ZSL setting, Top-1 accuracy on seen (s) and unseen (s) classes as well as their harmonic mean (H) in GZSL setting. . . . . . . . . . . . 102 Tab. 7.1 Statistics of data splits for COCO-Stuff and PASCAL-VOC datasets in terms of the number of classes and the number of images in the training and test splits. . . . . . . . . . . . . . . . . . . . . . . . 117 Tab. 7.2 Effect of word embeddings: Mean IoU of unseen classes in ZLSS with different word2vec, fastText and their combination on COCO-Stuff. Both HVSL and SPNet are based on ResNet101. . . 118 Tab. 7.3 Effect of CNN architectures: ZLSS with different CNN architec- tures, i.e. VGG and ResNet101 on COCO-Stuff and PASCAL- VOC. Word embedding is the ft + w2v. . . . . . . . . . . . . . . . 118 Tab. 7.4 SPNet loss on (generalized) zero-shot learning tasks. Top-1 accu- racy on unseen classes is reported for ZSL and harmonic mean of seen and unseen classes is for GZSL. . . . . . . . . . . . . . . . 120 Tab. 8.1 Statistics of our data splits on Kinetics, UCF101 and SomethingV2 datasets. We follow the train, val, and test class splits of (Zhu and Yang, 2018) and (Cao et al., 2019) on Kinetics and SomethingV2 respectively. In addition, we add test videos (the second number under the second test column) from train classes for GFSV. We also introduce a new data split on UCF101 and for all datasets we propose 5-,10-,15-,24-way (the maximum number of test classes) and 1-,5-shot setting. . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Tab. 8.2 Comparing with the state-of-the-art few-shot video classifica- tion methods. We report top-1 accuracy on the novel classes of Kinetics and SomethingV2 for 1-shot and 5-shot tasks (both in 5-way). 3DFSV (ours, scratch): our R(2+1)D is trained from scratch; 3DFSV (ours, pretrained): our model is trained from the Sports1M-pretrained R(2+1)D. R-3DFSV (ours, pretrained): our model with retrieved videos, trained from the Sports1M- pretrained R(2+1)D. . . . . . . . . . . . . . . . . . . . . . . . . . . 135 Tab. 8.3 Generalized few-shot video classification results on Kinetics and UCF101 in 5-way tasks. We report top-1 accuracy on both base and novel classes. . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 Tab. 8.4 Ablation study on 5-way 1-shot video classification task on the meta-test set of Kinetics. PR: pretrain R(2+1)D on Sports1M; SS: self-supervised model of AVTS (Korbar et al., 2018); RL: representation learning on base classes; VR: retrieve unlabeled videos with tags (Thomee et al., 2015); BD: batch denoising. BC: best clip selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 B I B L I O G R A P H Y Z. Akata, M. Malinowski, M. Fritz, and B. Schiele (2016). Multi-Cue Zero-Shot Learning with Strong Supervision, in CVPR 2016. Cited on page 28. Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid (2013). Label embedding for attribute-based classification, in CVPR 2013. Cited on pages 4, 17, 18, 52, and 64. Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid (2015a). Label-Embedding for Image Classification, IEEE TPAMI. Cited on pages 27, 28, 29, 30, 31, 32, 36, 48, 55, 56, 57, 59, 60, 61, 63, 67, 68, 69, 71, 72, 75, 77, 84, 92, 115, 121, 128, and 146. Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele (2015b). Evaluation of output em- beddings for fine-grained image classification, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2015. Cited on pages 4, 16, 18, 27, 110, 112, and 121. Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele (2015c). Evaluation of Output Embeddings for Fine-Grained Image Classification, in CVPR 2015. Cited on pages 28, 29, 30, 31, 32, 34, 35, 36, 37, 38, 39, 40, 41, 52, 55, 56, 63, 66, 67, 68, 70, 71, 72, 75, 77, 84, 146, 155, and 161. Z. Al-Halah, M. Tapaswi, and R. Stiefelhagen (2016). Recovering the Missing Link: Predicting Class-Attribute Associations for Unsupervised Zero-Shot Learning, in CVPR 2016. Cited on page 18. R. Arandjelovic and A. Zisserman (2013). All about VLAD, in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition 2013. Cited on page 15. M. Arjovsky and L. Bottou (2017). Towards principled methods for training genera- tive adversarial networks, ICLR. Cited on pages 20, 81, 83, 96, and 97. M. Arjovsky, S. Chintala, and L. Bottou (2017). Wasserstein gan, ICML. Cited on pages 20, 81, 82, and 100. V. Badrinarayanan, A. Kendall, and R. Cipolla (2017). Segnet: A deep convolutional encoder-decoder architecture for image segmentation, TPAMI. Cited on page 111. A. Bansal, K. Sikka, G. Sharma, R. Chellappa, and A. Divakaran (2018). Zero-Shot Object Detection, in ECCV 2018. Cited on pages 24 and 115. E. Bart and S. Ullman (2005). Single-example learning of novel classes using repre- sentation by similarity, in BMVC 2005. Cited on page 28. R. H. Bartels and G. Stewart (1972). Solution of the matrix equation AX+ XB= C [F4], Commun. ACM, vol. 15(9), pp. 820–826. Cited on page 56. 165 166 bibliography H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool (2008). Speeded-up robust features (SURF), Computer vision and image understanding, vol. 110(3), pp. 346–359. Cited on pages 14 and 149. A. Bearman, O. Russakovsky, V. Ferrari, and L. Fei-Fei (2016). What’s the point: Semantic segmentation with point supervision, in ECCV 2016. Cited on pages 110 and 111. A. Bendale and T. E. Boult (2016). Towards Open Set Deep Networks, in CVPR 2016. Cited on page 54. A. Berg, J. Deng, and L. Fei-Fei (). ILSVRC 2010, http://www.image- net.org/challenges/LSVRC/2010/index. Cited on page 47. C. M. Bishop (2006). Pattern Recognition and Machine Learning (Information Science and Statistics), Springer-Verlag, Berlin, Heidelberg. Cited on page 3. M. Bucher, S. Herbin, and F. Jurie (2016). Improving Semantic Embedding Consis- tency by Metric Learning for Zero-Shot Classiffication, in ECCV 2016. Cited on page 28. M. Bucher, S. Herbin, and F. Jurie (2017). Generating Visual Representations for Zero-Shot Classification, ICCV Workshop. Cited on pages 20, 82, 95, and 98. M. Bucher, V. Tuan-Hung, M. Cord, and P. Pérez (2019). Zero-Shot Semantic Segmen- tation, in Advances in Neural Information Processing Systems 2019. Cited on page 25. H. Caesar, J. Uijlings, and V. Ferrari (2016). Region-based semantic segmentation with end-to-end training, in ECCV 2016. Cited on page 24. H. Caesar, J. Uijlings, and V. Ferrari (2018). COCO-Stuff: Thing and Stuff Classes in Context, in CVPR 2018. Cited on page 116. K. Cao, J. Ji, Z. Cao, C.-Y. Chang, and J. C. Niebles (2019). Few-shot video classifica- tion via temporal alignment, arXiv preprint arXiv:1906.11415. Cited on pages 25, 126, 127, 129, 133, 134, 135, 136, and 163. J. Carreira and A. Zisserman (2017). Quo vadis, action recognition? a new model and the kinetics dataset, in CVPR 2017. Cited on pages 135 and 143. S. Changpinyo, W.-L. Chao, B. Gong, and F. Sha (2016). Synthesized Classifiers for Zero-Shot Learning, in CVPR 2016. Cited on pages 19, 52, 58, 59, 63, 64, 66, 67, 68, 69, 70, 71, 72, 73, 75, 77, 93, 110, 121, and 146. S. Changpinyo, W.-L. Chao, and F. Sha (2017). Predicting visual exemplars of unseen classes for zero-shot learning, in Proceedings of the IEEE international conference on computer vision 2017. Cited on page 19. bibliography 167 W.-L. Chao, S. Changpinyo, B. Gong, and F. Sha (2016). An Empirical Study and Analysis of Generalized Zero-Shot Learning for Object Recognition in the Wild, in ECCV 2016. Cited on pages 54, 80, 115, and 120. O. Chapelle, B. Scholkopf, and A. Zien (2009). Semi-supervised learning, IEEE Transactions on Neural Networks, vol. 20(3), pp. 542–542. Cited on page 59. N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer (2002). SMOTE: synthetic minority over-sampling technique, Journal of artificial intelligence research. Cited on pages 5 and 79. N. V. Chawla, N. Japkowicz, and A. Kotcz (2004). Editorial: Special Issue on Learning from Imbalanced Data Sets, SIGKDD Explor. Newsl.. Cited on page 5. L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2018). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs, TPAMI. Cited on pages 24, 111, 112, 116, 118, 122, and 143. Q. Chen and V. Koltun (2017). Photographic Image Synthesis with Cascaded Refine- ment Networks, in ICCV 2017. Cited on page 80. T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020). A simple framework for contrastive learning of visual representations, arXiv preprint arXiv:2002.05709. Cited on pages 4 and 150. W.-Y. Chen, Y.-C. Liu, Z. Kira, Y.-C. F. Wang, and J.-B. Huang (2019). A Closer Look at Few-shot Classification, in International Conference on Learning Representations 2019. Cited on pages 22, 126, 128, 130, and 135. X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel (2016). InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets, in NIPS 2016. Cited on page 81. K. Crammer and Y. Singer (2002). On the Learnability and Design of Output Codes for Multiclass Problems, ML. Cited on page 35. J. Deng, W. Dong, R., L.-J. Li, K. Li, and L. Fei-Fei (2009). ImageNet: A Large-Scale Hierarchical Image Database, in CVPR 2009. Cited on pages 7, 47, 52, 60, 62, 86, 92, and 101. J. Deng, J. Krause, and L. Fei-Fei (2013). Fine-Grained Crowdsourcing for Fine- Grained Recognition, in CVPR 2013. Cited on page 35. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova (2018). Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805. Cited on page 150. 168 bibliography Z. Ding, M. Shao, and Y. Fu (2017). Low-Rank Embedded Ensemble Semantic Dictionary for Zero-Shot Learning, in CVPR 2017. Cited on pages 52 and 112. P. Dollár, V. Rabaud, G. Cottrell, and S. Belongie (2005). Behavior recognition via sparse spatio-temporal features. Cited on page 128. J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell (2014). DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition., in ICML 2014. Cited on page 52. N. Dong and E. P. Xing (2018). Few-Shot Semantic Segmentation with Prototype Learning. Cited on pages 24, 25, 111, and 112. A. Dosovitskiy and T. Brox (2016a). Generating images with perceptual similarity metrics based on deep networks, in NIPS 2016. Cited on page 106. A. Dosovitskiy and T. Brox (2016b). Inverting visual representations with convolu- tional networks, in CVPR 2016. Cited on page 106. M. Douze, A. Szlam, B. Hariharan, and H. Jégou (2018). Low-shot learning with large-scale diffusion, in CVPR 2018. Cited on page 127. K. Duan, D. Parikh, D. J. Crandall, and K. Grauman (2012). Discovering localized attributes for fine-grained recognition, in CVPR 2012. Cited on pages 28 and 35. S. K. Dwivedi, V. Gupta, R. Mitra, S. Ahmed, and A. Jain (2019). ProtoGAN: Towards Few Shot Learning for Action Recognition, arXiv preprint arXiv:1909.07945. Cited on page 128. M. Elhoseiny, B. Saleh, and A. Elgammal (). Write a classifier: Zero-shot learning using purely textual descriptions. Cited on page 17. M. Elhoseiny, B. Saleh, and A. Elgammal (2013). Write a classifier: Zero-shot learning using purely textual descriptions, in Proceedings of the IEEE International Conference on Computer Vision 2013. Cited on page 19. M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman (). The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results, http://www.pascal- network.org/challenges/VOC/voc2012/workshop/index.html. Cited on page 116. A. Farhadi, I. Endres, and D. Hoiem (2010). Attribute-centric recognition for cross- category generalization, in CVPR 2010. Cited on page 28. A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth (2009). Describing objects by their attributes, CVPR. Cited on pages 7, 18, 28, 30, 52, 53, 60, 63, and 161. C. Feichtenhofer, H. Fan, J. Malik, and K. He (2019). Slowfast networks for video recognition, in Proceedings of the IEEE International Conference on Computer Vision 2019. Cited on page 129. bibliography 169 C. Feichtenhofer, A. Pinz, and R. Wildes (2016a). Spatiotemporal residual networks for video action recognition, in Advances in neural information processing systems 2016. Cited on page 128. C. Feichtenhofer, A. Pinz, and R. P. Wildes (2017). Spatiotemporal multiplier networks for video action recognition, in Proceedings of the IEEE conference on computer vision and pattern recognition 2017. Cited on page 25. C. Feichtenhofer, A. Pinz, and A. Zisserman (2016b). Convolutional two-stream network fusion for video action recognition, in Proceedings of the IEEE conference on computer vision and pattern recognition 2016. Cited on pages 24, 25, 128, and 129. R. Felix, V. K. B. G, I. Reid, and G. Carneiro (2018a). Multi-modal Cycle-consistent Generalized Zero-Shot Learning, in ECCV 2018. Cited on pages 95, 96, 98, and 102. R. Felix, V. B. Kumar, I. Reid, and G. Carneiro (2018b). Multi-modal cycle-consistent generalized zero-shot learning, in Proceedings of the European Conference on Computer Vision (ECCV) 2018. Cited on pages 19 and 20. P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan (2010). Object Detection with Discriminatively Trained Part Based Models, PAMI. Cited on pages 29 and 31. V. Ferrari and A. Zisserman (2007). Learning Visual Attributes, in NIPS 2007. Cited on page 28. C. Finn, P. Abbeel, and S. Levine (2017). Model-agnostic meta-learning for fast adaptation of deep networks, in ICML 2017. Cited on pages 22, 23, and 128. C. Finn, K. Xu, and S. Levine (2018). Probabilistic model-agnostic meta-learning, in Advances in Neural Information Processing Systems 2018. Cited on page 151. A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, and T. Mikolov (2013). Devise: A deep visual-semantic embedding model, in NIPS 2013. Cited on pages 27, 28, 29, 32, 52, 54, 55, 67, 68, 69, 70, 71, 72, 75, 77, 84, 98, 115, and 146. Y. Fu, T. M. Hospedales, T. Xiang, Z. Fu, and S. Gong (2014). Transductive Multi-view Embedding for Zero-Shot Recognition and Annotation, in ECCV 2014. Cited on pages 19 and 20. Y. Fu, T. M. Hospedales, T. Xiang, and S. Gong (2015a). Transductive Multi-view Zero-Shot Learning, TPAMI. Cited on page 98. Y. Fu and L. Sigal (2016). Semi-Supervised Vocabulary-Informed Learning, in CVPR 2016. Cited on page 28. Z. Fu, T. Xiang, E. Kodirov, and S. Gong (2015b). Zero-Shot Object Recognition by Semantic Manifold Distance, in CVPR 2015. Cited on page 28. 170 bibliography L. Gao, Z. Guo, H. Zhang, X. Xu, and H. T. Shen (2017). Video captioning with attention-based LSTM and semantic consistency, IEEE Transactions on Multimedia, vol. 19(9), pp. 2045–2055. Cited on page 25. S. Garcia and F. Herrera (2008). An Extension on“Statistical Comparisons of Classi- fiers over Multiple Data Sets”for all Pairwise Comparisons, JLMR. Cited on page 69. A. Geiger, P. Lenz, and R. Urtasun (2012). Are we ready for autonomous driving? the kitti vision benchmark suite, in 2012 IEEE Conference on Computer Vision and Pattern Recognition 2012. Cited on page 25. D. Ghadiyaram, D. Tran, and D. Mahajan (2019). Large-scale weakly-supervised pre-training for video action recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2019. Cited on page 129. R. Girdhar, D. Ramanan, A. Gupta, J. Sivic, and B. Russell (2017). Actionvlad: Learning spatio-temporal aggregation for action classification, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2017. Cited on page 129. R. Girshick (2015). Fast r-cnn, in Proceedings of the IEEE international conference on computer vision 2015. Cited on page 24. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014). Generative Adversarial Nets, in NIPS 2014. Cited on pages 20, 80, 81, 82, 89, 94, 97, 98, and 147. J. Gordon, J. Bronskill, M. Bauer, S. Nowozin, and R. E. Turner (2018). Meta-learning probabilistic inference for prediction, arXiv preprint arXiv:1805.09921. Cited on page 151. R. Goyal, S. E. Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag, et al. (2017). The” Something Something” Video Database for Learning and Evaluating Visual Common Sense., in ICCV 2017. Cited on page 132. A. Gretton, K. M. Borgwardt, M. Rasch, B. Schölkopf, and A. J. Smola (2007). A kernel method for the two-sample-problem, in NIPS 2007. Cited on page 98. I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville (2017). Improved training of wasserstein gans, arXiv preprint arXiv:1704.00028. Cited on pages 81, 83, 87, 94, 97, and 147. B. Hariharan and R. Girshick (2017). Low-shot Visual Recognition by Shrinking and Hallucinating Features, ICCV. Cited on pages 21, 23, 82, 104, 105, 127, and 128. T. Hastie, R. Tibshirani, and J. Friedman (2008). The Elements of Statistical Learning (2nd Ed.), Springer. Cited on page 28. bibliography 171 K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2019). Momentum contrast for unsupervised visual representation learning, arXiv preprint arXiv:1911.05722. Cited on page 150. K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017). Mask r-cnn, in Proceedings of the IEEE international conference on computer vision 2017. Cited on page 24. K. He, X. Zhang, S. Ren, and J. Sun (2016). Deep Residual Learning for Image Recognition, in CVPR 2016. Cited on pages 2, 25, 28, 53, 63, 86, 116, 118, 136, 143, and 155. L. A. Hendricks, Z. Akata, M. Rohrbach, J. Donahue, B. Schiele, and T. Darrell (2016). Generating visual explanations, in ECCVVision 2016. Cited on page 107. G. Huang, Z. Liu, G. Pleiss, L. Van Der Maaten, and K. Weinberger (2019). Convolu- tional Networks with Dense Connectivity, IEEE transactions on pattern analysis and machine intelligence. Cited on page 15. S. Hussain and B. Triggs (2010). Feature Sets and Dimensionality Reduction for Visual Object Detection, in BMVC 2010. Cited on pages 29 and 31. L. Jain, W. Scheirer, and T. Boult (2014). Multi-Class Open Set Recognition Using Probability of Inclusion, in ECCV 2014. Cited on page 54. D. Jayaraman and K. Grauman (2014). Zero-shot recognition with unreliable at- tributes, in NIPS 2014. Cited on page 18. J. Ji, S. Buch, A. Soto, and J. C. Niebles (2018a). End-to-End Joint Semantic Segmen- tation of Actors and Actions in Video, in ECCV 2018. Cited on page 111. Z. Ji, Y. Fu, J. Guo, Y. Pang, Z. M. Zhang, et al. (2018b). Stacked semantics-guided at- tention model for fine-grained zero-shot learning, in Advances in Neural Information Processing Systems 2018. Cited on page 18. T. Joachims (2002). Optimizing search engines using clickthrough data, in ACM SIGKDD 2002. Cited on page 55. A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, and T. Mikolov (2016a). Fast- text. zip: Compressing text classification models, arXiv preprint arXiv:1612.03651. Cited on pages 16, 112, 116, and 134. A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov (2017). Bag of Tricks for Efficient Text Classification, in Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers 2017. Cited on page 131. A. Joulin, L. van der Maaten, A. Jabri, and N. Vasilache (2016b). Learning visual features from large weakly supervised data, in European Conference on Computer Vision 2016. Cited on page 128. 172 bibliography Ł. Kaiser, O. Nachum, A. Roy, and S. Bengio (2017). Learning to remember rare events, arXiv preprint arXiv:1703.03129. Cited on page 135. M. Kampffmeyer, Y. Chen, X. Liang, H. Wang, Y. Zhang, and E. P. Xing (2019). Rethinking knowledge graph propagation for zero-shot learning, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2019. Cited on page 19. B. Kang, Z. Liu, X. Wang, F. Yu, J. Feng, and T. Darrell (2019). Few-shot object detection via feature reweighting, in Proceedings of the IEEE International Conference on Computer Vision 2019. Cited on page 24. P. Kankuekul, A. Kawewong, S. Tangruamsub, and O. Hasegawa (2012). Online incremental attribute-based zero-shot learning, in CVPR 2012. Cited on pages 28 and 35. A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei (2014). Large-scale video classification with convolutional neural networks, in CVPR 2014. Cited on pages 24, 128, 129, and 134. W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. (2017). The kinetics human action video dataset, arXiv preprint arXiv:1705.06950. Cited on pages 129 and 132. C. C. Kemp, A. Edsinger, and E. Torres-Jara (2007). Challenges for robot manipulation in human environments [grand challenges of robotics], IEEE Robotics & Automation Magazine, vol. 14(1), pp. 20–29. Cited on page 25. A. Khoreva, R. Benenson, J. H. Hosang, M. Hein, and B. Schiele (2017). Simple Does It: Weakly Supervised Instance and Semantic Segmentation., in CVPR 2017. Cited on pages 24, 110, and 111. A. Khosla, N. Jayadevaprakash, B. Yao, and L. Fei-Fei (). Stanford Dogs Dataset, http:// vision.stanford.edu/ aditya86/ ImageNetDogs/ . Cited on pages 27, 30, and 35. D. P. Kingma and J. Ba (2014). Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980. Cited on page 116. D. P. Kingma and M. Welling (2014). Auto-encoding variational bayes, in ICLR 2014. Cited on pages 20, 96, 97, 98, and 99. T. N. Kipf and M. Welling (2017). Semi-supervised classification with graph convolu- tional networks, in ICLR 2017. Cited on page 149. E. Kodirov, T. Xiang, Z. Fu, and S. Gong (2015). Unsupervised domain adaptation for zero-shot learning, in ICCV 2015. Cited on pages 19, 20, and 28. http://vision.stanford.edu/aditya86/ImageNetDogs/ bibliography 173 E. Kodirov, T. Xiang, and S. Gong (2017). Semantic Autoencoder for Zero-Shot Learning, in CVPR 2017. Cited on pages 18, 55, 56, 66, 67, 68, 69, 70, 71, 72, 75, 77, and 146. B. Korbar, D. Tran, and L. Torresani (2018). Cooperative learning of audio and video models from self-supervised synchronization, in NeurIPS 2018. Cited on pages 129, 138, 139, and 163. A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012). ImageNet Classification with Deep Convolutional Neural Networks, in NIPS 2012. Cited on pages 15, 28, and 30. V. Kumar Verma, G. Arora, A. Mishra, and P. Rai (2018a). Generalized zero-shot learning via synthesized examples, in Proceedings of the IEEE conference on computer vision and pattern recognition 2018. Cited on pages 19 and 20. V. Kumar Verma, G. Arora, A. Mishra, and P. Rai (2018b). Generalized Zero-Shot Learning via Synthesized Examples, in CVPR 2018. Cited on pages 95, 96, 98, and 102. C. Lampert, H. Nickisch, and S. Harmeling (2013). Attribute-based classification for zero-shot visual object categorization, in TPAMI 2013. Cited on pages 7, 17, 18, 19, 27, 28, 30, 35, 36, 37, 38, 42, 52, 57, 58, 60, 61, 62, 63, 64, 66, 67, 68, 71, 75, 77, 80, 85, 86, 98, 110, 112, 121, 146, 156, and 161. I. Laptev (2005). On space-time interest points, International journal of computer vision, vol. 64(2-3), pp. 107–123. Cited on page 128. H. Larochelle, D. Erhan, and Y. Bengio (2008). Zero-data Learning of new tasks, in AAAI 2008. Cited on pages 28 and 52. A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther (2016). Autoencoding beyond pixels using a learned similarity metric, in ICML 2016. Cited on page 99. Y. LeCun, Y. Bengio, and G. Hinton (2015). Deep learning, nature, vol. 521(7553), pp. 436–444. Cited on pages 15 and 149. Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel (1989). Backpropagation applied to handwritten zip code recognition, Neural computation, vol. 1(4), pp. 541–551. Cited on page 15. C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. P. Aitken, A. Tejani, J. Totz, Z. Wang, et al. (2017). Photo-Realistic Single Image Super- Resolution Using a Generative Adversarial Network., in CVPR 2017. Cited on page 97. J. Lei Ba, K. Swersky, S. Fidler, et al. (2015). Predicting deep zero-shot convolutional neural networks using textual descriptions, in Proceedings of the IEEE International Conference on Computer Vision 2015. Cited on page 19. 174 bibliography Y. Li, K. Swersky, and R. Zemel (2015). Generative moment matching networks, in ICML 2015. Cited on pages 20, 82, and 98. D. Lin, J. Dai, J. Jia, K. He, and J. Sun (2016). Scribblesup: Scribble-supervised convolutional networks for semantic segmentation, in CVPR 2016. Cited on pages 110 and 111. S. Liu, M. Long, J. Wang, and M. I. Jordan (2018). Generalized zero-shot learning with deep calibration network, in Advances in Neural Information Processing Systems 2018. Cited on page 19. W. Liu, A. Rabinovich, and A. C. Berg (2016). Parsenet: Looking wider to see better, ICLR workshop. Cited on page 111. J. Long, E. Shelhamer, and T. Darrell (2015). Fully convolutional networks for seman- tic segmentation, in CVPR 2015. Cited on pages 2, 24, 111, 112, and 155. D. G. Lowe (1999). Object recognition from local scale-invariant features, in Proceed- ings of the seventh IEEE international conference on computer vision 1999. Cited on page 14. D. G. Lowe (2004). Distinctive Image Features from Scale-Invariant Keypoints, IJCV, vol. 60, pp. 91–110. Cited on pages 52 and 149. D. Mahajan, R. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, and L. van der Maaten (2018). Exploring the limits of weakly supervised pretraining, in Proceedings of the European Conference on Computer Vision (ECCV) 2018. Cited on page 129. A. Mahendran and A. Vedaldi (2015). Understanding deep image representations by inverting them, in CVPR 2015. Cited on page 106. T. Mensink, J. Verbeek, F. Perronnin, and G. Csurka (2012). Metric Learning for Large Scale Image Classification: Generalizing to New Classes at Near-Zero Cost, in ECCV 2012. Cited on page 127. T. E. J. Mensink, E. Gavves, and C. G. M. Snoek (2014). COSTA: Co-Occurrence Statistics for Zero-Shot Classification, in CVPR 2014. Cited on page 28. K. Mikolajczyk and C. Schmid (2004). Scale & affine invariant interest point detectors, International journal of computer vision, vol. 60(1), pp. 63–86. Cited on page 14. K. Mikolajczyk and C. Schmid (2005). A performance evaluation of local descriptors, IEEE transactions on pattern analysis and machine intelligence, vol. 27(10), pp. 1615– 1630. Cited on page 14. T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsch, and A. Joulin (2018). Advances in Pre-Training Distributed Word Representations, in LREC 2018. Cited on page 116. bibliography 175 T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013a). Distributed representations of words and phrases and their compositionality, in Advances in neural information processing systems 2013. Cited on page 16. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013b). Distributed Representations of Words and Phrases and their Compositionality, in NIPS 2013. Cited on pages 28, 30, 36, 63, 92, 96, 105, 112, 114, 116, and 149. G. A. Miller (1995). WordNet: a lexical database for English, CACM, vol. 38, pp. 39–41. Cited on pages 30, 36, 62, and 112. M. Mirza and S. Osindero (2014). Conditional generative adversarial nets, arXiv preprint arXiv:1411.1784. Cited on pages 81, 82, and 97. A. Mishra, S. Krishna Reddy, A. Mittal, and H. A. Murthy (2018). A generative model for zero shot learning using conditional variational autoencoders, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops 2018. Cited on pages 19 and 20. T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida (2018). Spectral normalization for generative adversarial networks, in ICLR 2018. Cited on page 97. Q. Nguyen, M. C. Mukkamala, and M. Hein (2019). On the loss landscape of a class of deep neural networks with no bad local valleys. Cited on page 15. M.-E. Nilsback and A. Zisserman (2008). Automated Flower Classification over a Large Number of Classes, in ICCVGI 2008. Cited on pages 17, 85, 86, and 101. M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. Corrado, and J. Dean (2014). Zero-Shot Learning by Convex Combination of Semantic Embeddings, in ICLR 2014. Cited on pages 18, 52, 58, 67, 69, 71, 72, 75, 77, and 146. S. J. Oh, R. Benenson, A. Khoreva, Z. Akata, M. Fritz, B. Schiele, et al. (2017). Exploiting saliency for object segmentation from image level labels, in CVPR 2017. Cited on pages 110 and 111. M. Oquab, L. Bottou, I. Laptev, and J. Sivic (2015). Is object localization for free?- weakly-supervised learning with convolutional neural networks, in CVPR 2015. Cited on page 4. A. Owens and A. A. Efros (2018). Audio-visual scene analysis with self-supervised multisensory features, in Proceedings of the European Conference on Computer Vision (ECCV) 2018. Cited on page 129. M. Palatucci, D. Pomerleau, G. Hinton, and T. Mitchell (2009). Zero-shot learning with semantic output codes, in NIPS 2009. Cited on pages 28 and 31. 176 bibliography D. Papadopoulos, A. Clarke, F. Keller, and V. Ferrari (2014). Training Object Class Detectors from Eye Tracking Data, in ECCV 2014. Cited on page 28. G. Papandreou, L.-C. Chen, K. Murphy, and A. L. Yuille (2015). Weakly-and semi- supervised learning of a dcnn for semantic image segmentation, in ICCV 2015. Cited on pages 110 and 111. D. Parikh and K. Grauman (2011). Relative attributes, in ICCV 2011. Cited on page 28. A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017). Automatic differentiation in PyTorch, in NIPS-W 2017. Cited on page 116. D. Pathak, E. Shelhamer, J. Long, and T. Darrell (2015). Fully convolutional multi- class multiple instance learning, in ICLR workshop 2015. Cited on pages 110 and 111. G. Patterson and J. Hays (2012). SUN Attribute Database: Discovering, Annotating, and Recognizing Scene Attributes, in CVPR 2012. Cited on pages 7, 52, 60, 61, 63, 85, 86, 101, 121, and 161. J. Pennington, R. Socher, and C. D. Manning (2014). GloVe: Global Vectors for Word Representation, in EMNLP 2014. Cited on pages 16, 28, 30, 36, and 149. S. Purushwalkam, M. Nickel, A. Gupta, and M. Ranzato (2019). Task-Driven Mod- ular Networks for Zero-Shot Compositional Learning, in Proceedings of the IEEE International Conference on Computer Vision 2019. Cited on page 149. C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2017). Pointnet: Deep learning on point sets for 3d classification and segmentation, in Proceedings of the IEEE conference on computer vision and pattern recognition 2017. Cited on page 24. H. Qi, M. Brown, and D. G. Lowe (2018). Low-Shot Learning With Imprinted Weights, in CVPR 2018. Cited on pages 21, 22, 103, 104, 126, and 128. R. Qiao, L. Liu, C. Shen, and A. van den Hengel (2016). Less Is More: Zero-Shot Learning From Online Textual Documents With Noise Suppression, in CVPR 2016. Cited on pages 28 and 112. S. Qiao, C. Liu, W. Shen, and A. L. Yuille (2018). Few-shot image recognition by predicting parameters from activations, in CVPR 2018. Cited on page 22. A. Radford, L. Metz, and S. Chintala (2016). Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks., in ICLR 2016. Cited on pages 81 and 97. bibliography 177 P. Rajpurkar, J. Irvin, K. Zhu, B. Yang, H. Mehta, T. Duan, D. Ding, A. Bagul, C. Langlotz, K. Shpanskaya, et al. (2017). Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning, arXiv preprint arXiv:1711.05225. Cited on page 151. K. Rakelly, E. Shelhamer, T. Darrell, A. Efros, and S. Levine (2018). Conditional Networks for Few-Shot Semantic Segmentation. Cited on pages 24 and 112. S. Ravi and H. Larochelle (2016). Optimization as a model for few-shot learning, in ICLR 2016. Cited on pages 23, 126, 127, and 128. A. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson (2014). CNN features off-the- shelf: an astounding baseline for recognition, in CVPR Workshops 2014. Cited on page 28. J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016). You only look once: Unified, real-time object detection, in Proceedings of the IEEE conference on computer vision and pattern recognition 2016. Cited on page 24. S. Reed, Z. Akata, H. Lee, and B. Schiele (2016a). Learning Deep Representations of Fine-Grained Visual Descriptions, in CVPR 2016. Cited on pages 85, 86, 92, 101, 108, and 112. S. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee (2016b). Learning What and Where to Draw, in NIPS 2016. Cited on page 81. S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee (2016c). Generative Adversarial Text to Image Synthesis, in ICML 2016. Cited on pages 80, 81, and 97. G. Riegler, A. Osman Ulusoy, and A. Geiger (2017). Octnet: Learning deep 3d representations at high resolutions, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2017. Cited on page 24. M. Rohrbach, S. Ebert, and B. Schiele (2013). Transfer Learning in a Transductive Setting, in NIPS 2013. Cited on pages 45 and 98. M. Rohrbach, M. Stark, and B.Schiele (2011). Evaluating knowledge transfer and zero-shot learning in a large-scale setting, in CVPR 2011. Cited on pages 35, 45, 52, 54, and 62. M. Rohrbach, M. Stark, G. Szarvas, and B. Schiele (2012). Combining Language Sources and Robust Semantic Relatedness for Attribute-Based Knowledge Transfer, Trends and Topics in Computer Vision. Cited on page 17. B. Romera-Paredes, E. OX, and P. H. Torr (2015). An embarrassingly simple approach to zero-shot learning, in ICML 2015. Cited on pages 18, 27, 28, 29, 31, 37, 38, 52, 55, 56, 66, 67, 68, 71, 72, 75, 77, 146, and 161. 178 bibliography O. Ronneberger, P. Fischer, and T. Brox (2015). U-net: Convolutional networks for biomedical image segmentation, in International Conference on Medical image computing and computer-assisted intervention 2015. Cited on page 24. S. Sadanand and J. J. Corso (2012). Action bank: A high-level representation of activity in video, in 2012 IEEE Conference on Computer Vision and Pattern Recognition 2012. Cited on page 128. T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen (2016). Improved techniques for training gans, in NIPS 2016. Cited on page 80. J. Sánchez, F. Perronnin, T. Mensink, and J. Verbeek (2013). Image classification with the fisher vector: Theory and practice, International journal of computer vision, vol. 105(3), pp. 222–245. Cited on page 15. W. J. Scheirer, A. Rocha, A. Sapkota, and T. E. Boult (2013). Towards Open Set Recognition, TPAMI. Cited on page 54. E. Schoenfeld, S. Ebrahimi, S. Sinha, T. Darrell, and Z. Akata (2019). Generalized Zero- and Few-Shot Learning via Aligned Variational Autoencoders, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2019. Cited on page 128. E. Schonfeld, S. Ebrahimi, S. Sinha, T. Darrell, and Z. Akata (2019). Generalized zero-and few-shot learning via aligned variational autoencoders, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2019. Cited on pages 19 and 20. A. Shaban, S. Bansal, Z. Liu, I. Essa, and B. Boots (2017). One-shot learning for semantic segmentation, in BMVC 2017. Cited on pages 24, 25, 110, and 112. K. Simonyan and A. Zisserman (2014a). Two-stream convolutional networks for action recognition in videos, in Advances in neural information processing systems 2014. Cited on pages 128 and 129. K. Simonyan and A. Zisserman (2014b). Very deep convolutional networks for large- scale image recognition, arXiv preprint arXiv:1409.1556. Cited on pages 15, 52, 116, and 118. J. Snell, K. Swersky, and R. Zemel (2017). Prototypical networks for few-shot learning, in NIPS 2017. Cited on pages 22, 23, 25, 105, and 112. R. Socher, M. Ganjoo, C. D. Manning, and A. Ng (2013). Zero-Shot Learning Through Cross-Modal Transfer, in NIPS 2013. Cited on pages 17, 27, 28, 31, 37, 38, 45, 52, 54, 57, 66, 67, 69, 71, 72, 75, 77, 146, and 161. J. Song, C. Shen, Y. Yang, Y. Liu, and M. Song (2018). Transductive Unbiased Embedding for Zero-Shot Learning, in CVPR 2018. Cited on pages 19 and 102. bibliography 179 K. Soomro, A. R. Zamir, and M. Shah (2012). UCF101: A dataset of 101 human actions classes from videos in the wild, arXiv preprint arXiv:1212.0402. Cited on page 132. F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales (2018). Learning to compare: Relation network for few-shot learning, in CVPR 2018. Cited on page 23. T. Sylvain, L. Petrini, and D. Hjelm (2019). Locality and compositionality in zero-shot learning, arXiv preprint arXiv:1912.12179. Cited on page 148. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015). Going deeper with convolutions, in CVPR 2015. Cited on pages 15, 28, 36, 63, and 67. B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.-J. Li (2015). YFCC100M: The new data in multimedia research, arXiv preprint arXiv:1503.01817. Cited on pages 127, 128, 130, 131, 133, 138, 159, and 163. A. Torralba, A. A. Efros, et al. (2011). Unbiased look at dataset bias., in CVPR 2011. Cited on page 128. D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri (2015). Learning spa- tiotemporal features with 3d convolutional networks, in CVPR 2015. Cited on pages 128, 129, and 135. D. Tran, H. Wang, L. Torresani, and M. Feiszli (2019). Video Classification with Channel-Separated Convolutional Networks, arXiv preprint arXiv:1904.02811. Cited on page 129. D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri (2018). A closer look at spatiotemporal convolutions for action recognition, in CVPR 2018. Cited on pages 2, 129, 134, 135, and 155. E. Triantafillou, R. Zemel, and R. Urtasun (2017). Few-shot learning through an information retrieval lens, in NeurIPS 2017. Cited on page 23. E. Triantafillou, T. Zhu, V. Dumoulin, P. Lamblin, U. Evci, K. Xu, R. Goroshin, C. Gelada, K. Swersky, P.-A. Manzagol, et al. (2019). Meta-dataset: A dataset of datasets for learning to learn from few examples, arXiv preprint arXiv:1903.03096. Cited on pages 22 and 151. I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun (2005). Large Margin Methods for Structured and Interdependent Output Variables, JMLR. Cited on page 56. N. Usunier, D. Buffoni, and P. Gallinari (2009). Ranking with Ordered Weighted Pairwise Classification, in ICML 2009. Cited on page 56. 180 bibliography L. van der Maaten and G. Hinton (2008). Visualizing High-Dimensional Data Using t-SNE, JMLR. Cited on page 46. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017). Attention is all you need, in Advances in neural information processing systems 2017. Cited on page 151. V. K. Verm and P. Rai (2017). A Simple Exponential Family Framework for Zero-Shot Learning, in ECML 2017. Cited on pages 59, 66, 69, 72, 74, 75, 77, and 146. V. K. Verma and P. Rai (2017). A simple exponential family framework for zero-shot learning, in Joint European Conference on Machine Learning and Knowledge Discovery in Databases 2017. Cited on pages 19 and 121. O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. (2016). Matching networks for one shot learning, in NIPS 2016. Cited on pages 22, 23, 25, 112, 126, and 128. B. Wallace and B. Hariharan (2019). Few-Shot Generalization for Single-Image 3D Reconstruction via Priors, in Proceedings of the IEEE International Conference on Computer Vision 2019. Cited on page 24. H. Wang and C. Schmid (2013). Action recognition with improved trajectories, in Proceedings of the IEEE international conference on computer vision 2013. Cited on page 128. L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool (2016). Temporal segment networks: Towards good practices for deep action recognition, in European conference on computer vision 2016. Cited on page 128. X. Wang and A. Gupta (2016). Generative image modeling using style and structure adversarial networks, in ECCV 2016. Cited on page 81. X. Wang, Y. Ye, and A. Gupta (2018a). Zero-shot recognition via semantic embeddings and knowledge graphs, in Proceedings of the IEEE conference on computer vision and pattern recognition 2018. Cited on pages 19 and 116. X. Wang, Y. Ye, and A. Gupta (2018b). Zero-Shot Recognition via Semantic Embed- dings and Knowledge Graphs, in CVPR 2018. Cited on page 149. X. Wang, F. Yu, R. Wang, T. Darrell, and J. E. Gonzalez (2019a). Tafe-net: Task-aware feature embeddings for low shot learning, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2019. Cited on page 22. Y. Wang, W.-L. Chao, K. Q. Weinberger, and L. van der Maaten (2019b). SimpleShot: Revisiting Nearest-Neighbor Classification for Few-Shot Learning, arXiv preprint arXiv:1911.04623. Cited on page 128. Y. Wang, R. Girshick, M. Hebert, and B. Hariharan (2018c). Low-Shot Learning from Imaginary Data, in CVPR 2018. Cited on pages 23, 105, and 128. bibliography 181 P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona (2010). Caltech-UCSD Birds 200, Technical report CNS-TR-2010-001, Caltech. Cited on pages 7, 17, 27, 30, 35, 52, 60, 61, 63, 85, 86, 101, 121, and 161. J. Weston, S. Bengio, and N. Usunier (2011). WSABIE: Scaling Up to Large Vocabulary Image Annotation, in IJCAI 2011. Cited on pages 30, 31, 32, 35, and 56. Y. Xian, Z. Akata, G. Sharma, Q. Nguyen, M. Hein, and B. Schiele (2016). Latent Embedding for Zero-shot Recognition, in CVPR 2016. Cited on pages 9, 28, 52, 57, 66, 67, 68, 71, 72, 75, 77, 85, 115, and 146. Y. Xian, S. Choudhury, Y. He, B. Schiele, and Z. Akata (2019a). SPNet: Semantic Projection Network for Zero- and Few-Label Semantic Segmentation, in CVPR 2019. Cited on page 10. Y. Xian, C. H. Lampert, B. Schiele, and Z. Akata (2019b). Zero-shot learning-A comprehensive evaluation of the good, the bad and the ugly, TPAMI. Cited on pages 10, 101, 103, 116, 117, 120, 121, and 157. Y. Xian, T. Lorenz, B. Schiele, and Z. Akata (2018). Feature Generating Networks for Zero-Shot Learning, in CVPR 2018. Cited on pages 10, 95, 96, 98, 99, 102, 103, and 157. Y. Xian, B. Schiele, and Z. Akata (2017). Zero-Shot Learning - The Good, the Bad and the Ugly, in CVPR 2017. Cited on pages 10, 80, 82, 86, 87, 92, and 157. Y. Xian, S. Sharma, B. Schiele, and Z. Akata (2019c). F-VAEGAN-D2: A Feature Generating Framework for Any-Shot Learning, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2019. Cited on pages 10 and 128. J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba (2010). SUN database: Large-scale scene recognition from abbey to zoo, in CVPR 2010. Cited on page 18. G.-S. Xie, L. Liu, X. Jin, F. Zhu, Z. Zhang, J. Qin, Y. Yao, and L. Shao (2019). Attentive region embedding network for zero-shot learning, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2019. Cited on page 18. X. Xu, T. Hospedales, and S. Gong (2015). Semantic embedding space for zero-shot action recognition, in 2015 IEEE International Conference on Image Processing (ICIP) 2015. Cited on page 25. X. Xu, Y. Yang, D. Zhang, H. T. Shen, and J. Song (2017). Matrix Tri-Factorization with Manifold Regularizations for Zero-shot Learning, in CVPR 2017. Cited on page 52. I. Z. Yalniz, H. Jégou, K. Chen, M. Paluri, and D. Mahajan (2019). Billion-scale semi-supervised learning for image classification, arXiv preprint arXiv:1905.00546. Cited on page 129. 182 bibliography J. Yang, K. Yu, Y. Gong, and T. Huang (2009). Linear spatial pyramid matching using sparse coding for image classification, in 2009 IEEE Conference on computer vision and pattern recognition 2009. Cited on page 15. Y. Yang and D. Ramanan (2011). Articulated pose estimation with flexible mixtures- of-parts, in CVPR 2011. Cited on pages 29 and 31. M. Ye and Y. Guo (2017). Zero-shot classification with discriminative semantic representation learning, in CVPR 2017. Cited on pages 59, 60, 66, 77, and 146. J. Yoon, T. Kim, O. Dia, S. Kim, Y. Bengio, and S. Ahn (2018). Bayesian model-agnostic meta-learning, in Advances in Neural Information Processing Systems 2018. Cited on page 151. X. Yu and Y. Aloimonos (2010). Attribute-Based Transfer Learning for Object Catego- rization with Zero or One Training Example, in ECCV 2010. Cited on pages 28, 35, and 52. J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici (2015). Beyond short snippets: Deep networks for video classification, in Proceedings of the IEEE conference on computer vision and pattern recognition 2015. Cited on page 129. R. Zabih and J. Woodfill (1994). Non-parametric local transforms for computing visual correspondence, in European conference on computer vision 1994. Cited on page 14. H. Zhang, K. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, and A. Agrawal (2018a). Context encoding for semantic segmentation, in CVPR 2018. Cited on pages 24 and 111. H. Zhang, X. Shang, W. Yang, H. Xu, H. Luan, and T.-S. Chua (2016a). Online Collaborative Learning for Open-Vocabulary Visual Classifiers, in CVPR 2016. Cited on page 54. H. Zhang, T. Xu, M. Elhoseiny, X. Huang, S. Zhang, A. Elgammal, and D. Metaxas (2016b). Spda-cnn: Unifying semantic part detection and abstraction for fine- grained recognition, in CVPR 2016. Cited on page 148. H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. Metaxas (2017a). Stack- GAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks, in ICCV 2017. Cited on pages 80, 81, and 93. L. Zhang, T. Xiang, and S. Gong (2017b). Learning a deep embedding model for zero-shot learning, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2017. Cited on page 18. N. Zhang, J. Donahue, R. Girshick, and T. Darrell (2014). Part-based R-CNNs for fine-grained category detection, in ECCV 2014. Cited on page 148. bibliography 183 X. Zhang, Y. Wei, Y. Yang, and T. Huang (2018b). SG-One: Similarity Guidance Network for One-Shot Semantic Segmentation, arXiv preprint arXiv:1810.09091. Cited on page 112. Z. Zhang and V. Saligrama (2015). Zero-Shot Learning via Semantic Similarity Embedding, in ICCV 2015. Cited on pages 27, 37, 38, 52, 58, 66, 67, 68, 71, 72, 75, 77, 146, and 161. Z. Zhang and V. Saligrama (2016). Zero-Shot Learning via Joint Semantic Similarity Embedding, in CVPR 2016. Cited on pages 27, 37, 38, 110, 115, and 161. H. Zhao, X. Puig, B. Zhou, S. Fidler, and A. Torralba (2017a). Open Vocabulary Scene Parsing, in ICCV 2017. Cited on pages 25, 111, and 112. H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017b). Pyramid Scene Parsing Network, in CVPR 2017. Cited on page 111. S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. Torr (2015). Conditional random fields as recurrent neural networks, in ICCV 2015. Cited on page 111. B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva (2014). Learning deep features for scene recognition using places database, in NIPS 2014. Cited on page 67. D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schölkopf (2004). Learning with local and global consistency, in NIPS 2004. Cited on page 59. J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros (2017). Unpaired Image-to-Image Transla- tion using Cycle-Consistent Adversarial Networks, in ICCV 2017. Cited on page 97. L. Zhu and Y. Yang (2018). Compound memory networks for few-shot video classi- fication, in ECCV 2018. Cited on pages 25, 126, 127, 128, 129, 133, 134, 135, 136, and 163. X. Zhu and D. Ramanan (2012). Face detection, pose estimation, and landmark localization in the wild, in CVPR 2012. Cited on pages 29 and 31. Y. Zhu, M. Elhoseiny, B. Liu, X. Peng, and A. Elgammal (2018a). A generative adversarial approach for zero-shot learning from noisy texts, in Proceedings of the IEEE conference on computer vision and pattern recognition 2018. Cited on pages 19 and 20. Y. Zhu, M. Elhoseiny, B. Liu, X. Peng, and A. Elgammal (2018b). A Generative Adversarial Approach for Zero-Shot Learning From Noisy Texts, in CVPR 2018. Cited on page 98. B. Zoph and Q. V. Le (2016). Neural architecture search with reinforcement learning, arXiv preprint arXiv:1611.01578. Cited on page 15. bibliography 185 Title Page Abstract Zusammenfassung Acknowledgements Contents 1 Introduction 1.1 Challenges of learning from limited labeled data 1.1.1 Zero-shot image classification 1.1.2 Few-shot image classification. 1.1.3 Zero-shot and few-shot learning tasks beyond image classification 1.2 Contributions of the thesis 1.2.1 Contributions to zero-shot image classification 1.2.2 Contributions to few-shot image classification 1.2.3 Contributions to zero-shot and few-shot tasks beyond image classification 1.3 Outline of the thesis 2 Related work 2.1 Zero-shot image classification 2.1.1 Problem definition 2.1.2 Evaluation protocol 2.1.3 A literature review of zero-shot approaches 2.1.4 Relations to our work 2.2 Few-shot image classification 2.2.1 Problem definition 2.2.2 Evaluation protocols 2.2.3 A literature review of few-shot approaches 2.2.4 Relations to our work 2.3 Zero-shot and few-shot tasks beyond image classification 2.3.1 Semantic image segmentation 2.3.2 Video action recognition 2.3.3 Relations to our work 3 Latent Embedding for Zero-Shot Image Classification 3.1 Introduction 3.2 Background: Bilinear Joint Embeddings 3.3 Latent Embeddings Model (LatEm) 3.3.1 Objective 3.3.2 Optimization 3.3.3 Model selection 3.3.4 Discussion 3.4 Experiments 3.4.1 Zero-shot Learning Experiments 3.4.2 Generalized Zero-shot Learning Setting 3.5 Conclusions 4 Zero-Shot Learning: the Good, the Bad and the Ugly 4.1 Introduction 4.2 Related Work 4.3 Evaluated Methods 4.3.1 Learning Linear Compatibility 4.3.2 Learning Nonlinear Compatibility 4.3.3 Learning Intermediate Attribute Classifiers 4.3.4 Hybrid Models 4.3.5 Transductive Zero-Shot Learning Setting 4.4 Datasets 4.4.1 Attribute Datasets 4.4.2 Large-Scale ImageNet 4.5 Evaluation Protocol 4.5.1 Image and Class Embedding 4.5.2 Dataset Splits 4.5.3 Evaluation Criteria 4.6 Experiments 4.6.1 Zero-Shot Learning Experiments 4.6.2 Generalized Zero-Shot Learning Results 4.6.3 Transductive (Generalized) Zero-Shot Learning 4.7 Conclusion 5 Feature Generating Networks for Zero-Shot Image Classification 5.1 Introduction 5.2 Related work 5.3 Feature Generation & Classification in ZSL 5.3.1 Feature Generation 5.3.2 Classification 5.4 Experiments 5.4.1 Comparing with State-of-the-Art 5.4.2 Analyzing f-xGAN Under Different Conditions 5.4.3 Large-Scale Experiments 5.4.4 Feature vs Image Generation 5.5 Conclusion 6 Enhanced Feature Generation Frameworks for Low-Shot Learning 6.1 Introduction 6.2 Related Work 6.3 f-VAEGAN-D2 Model 6.3.1 Baseline Feature Generating Models 6.3.2 Our f-VAEGAN-D2 Model 6.4 Experiments 6.4.1 (Generalized) Zero-shot Learning 6.4.2 (Generalized) Few-shot Learning 6.4.3 Interpreting Synthesized Features 6.5 Conclusion 7 Zero-Label and Few-Label Semantic Segmentation 7.1 Introduction 7.2 Related Works 7.3 Approach 7.3.1 Semantic Projection Network (SPNet) 7.3.2 Baseline: Hinge Visual-Semantic Loss (HVSL) 7.4 Experiment 7.4.1 Zero-Label Semantic Segmentation Task 7.4.2 Few-Label Semantic Segmentation Task 7.4.3 Qualitative Results 7.5 Conclusions 8 Generalized Many-Way Few-Shot Video Classification 8.1 Introduction 8.2 Related work 8.3 R-3DFSV Approach 8.3.1 3D CNN for FSV (3DFSV) 8.3.2 Retrieval-enhanced 3DFSV (R-3DFSV) 8.4 Experiments 8.4.1 Experimental settings 8.4.2 Comparing with the state-of-the-art 8.4.3 Increasing the number of classes in FSV 8.4.4 Evaluating base and novel classes in GFSV 8.4.5 Ablation study and retrieved clips 8.4.6 Qualitative results 8.5 Conclusion 9 Conclusions and future perspectives 9.1 Discussion of contributions 9.2 Future Perspectives 9.2.1 Zero-shot image classification 9.2.2 Few-shot image classification 9.2.3 Zero-shot and few-shot learning beyond image classification 9.2.4 A broader view on the topic List of Figures List of Tables Bibliography