key: cord-0564813-0m1epzw8 authors: Kasaei, Hamidreza; Xiong, Songsong title: Lifelong Ensemble Learning based on Multiple Representations for Few-Shot Object Recognition date: 2022-05-04 journal: nan DOI: nan sha: 26138b78e07be3f6b6cd4d4585587ac071ea3820 doc_id: 564813 cord_uid: 0m1epzw8 Service robots are integrating more and more into our daily lives to help us with various tasks. In such environments, robots frequently face new objects while working in the environment and need to learn them in an open-ended fashion. Furthermore, such robots must be able to recognize a wide range of object categories. In this paper, we present a lifelong ensemble learning approach based on multiple representations to address the few-shot object recognition problem. In particular, we form ensemble methods based on deep representations and handcrafted 3D shape descriptors. To facilitate lifelong learning, each approach is equipped with a memory unit for storing and retrieving object information instantly. The proposed model is suitable for open-ended learning scenarios where the number of 3D object categories is not fixed and can grow over time. We have performed extensive sets of experiments to assess the performance of the proposed approach in offline, and open-ended scenarios. For the evaluation purpose, in addition to real object datasets, we generate a large synthetic household objects dataset consisting of 27000 views of 90 objects. Experimental results demonstrate the effectiveness of the proposed method on 3D object recognition tasks, as well as its superior performance over the state-of-the-art approaches. Additionally, we demonstrated the effectiveness of our approach in both simulated and real-robot settings, where the robot rapidly learned new categories from limited examples. Abstract-Service robots are integrating more and more into our daily lives to help us with various tasks. In such environments, robots frequently face new objects while working in the environment and need to learn them in an open-ended fashion. Furthermore, such robots must be able to recognize a wide range of object categories. In this paper, we present a lifelong ensemble learning approach based on multiple representations to address the few-shot object recognition problem. In particular, we form ensemble methods based on deep representations and handcrafted 3D shape descriptors. To facilitate lifelong learning, each approach is equipped with a memory unit for storing and retrieving object information instantly. The proposed model is suitable for open-ended learning scenarios where the number of 3D object categories is not fixed and can grow over time. We have performed extensive sets of experiments to assess the performance of the proposed approach in offline, and open-ended scenarios. For the evaluation purpose, in addition to real object datasets, we generate a large synthetic household objects dataset consisting of 27000 views of 90 objects. Experimental results demonstrate the effectiveness of the proposed method on 3D object recognition tasks, as well as its superior performance over the state-of-theart approaches. Additionally, we demonstrated the effectiveness of our approach in both simulated and real-robot settings, where the robot rapidly learned new categories from limited examples. Index Terms-Few-shot learning, lifelong learning, Ensemble learning, 3D object recognition, Service robots N OWADAYS many countries are facing labor shortages due to the aging of the population and the COVID-19 pandemic. Therefore, an ever-increasing amount of attention has been focused on service robots in order to solve these shortages. Intelligent robots require a variety of functionalities, such as perception, manipulation, and navigation in order to safely interact with human users and environments [1] . Among these functionalities, we believe, three-dimensional (3D) object perception plays a significant role because such robots need to recognize which objects exist in the environment and where they are to help users in various tasks. As an example, consider a robot-assisted packaging scenario, where a robot is working alongside a human user (Fig. 1) . In such scenarios, the user may ask the robot to handover a specific object. To accomplish this task, the robot should be able to represent the current state of the environment in terms of objects' pose and category label, and then plan a collision-free trajectory to grasp Fig. 1 . An example of robot-assisted packaging scenario: (left) the dual arm robot perceives the environment through an RGB-D camera, and then plans a collision-free trajectory to grasp the target object, and delivers it to the user; (right) In order to successfully handover a target object to a user, the robot should have a clear understanding of its current configuration (i.e., joint poses) and recognize all objects that are not part of the robot's body. the target object, pick it up, and deliver it to the user. In such collaborative settings, the robot often faces new objects over time and it is not feasible to assume that one can pre-program all object categories for the robot in advance. Furthermore, the human co-worker expects that the robot can easily adapt to different tasks by learning a new set of object categories based on a few instructions. In such lifelong robot learning scenarios, object representation is crucial because the output of this module is used for both learning and recognition [2] . Most of the object representation approaches encode either geometric features [3] or textural features [4] of an object. However, there are a few works that encode both geometric and textural features simultaneously [5] . Each of these approaches has its own cons and pros. However, to benefit from all of these approaches, an ensemble learning method can be developed, where multiple object recognition methods based on individual representation are trained and their predictions are combined [6] , [7] . Ensemble learning not only reduces the misclassification rate but also provides predictions that are superior to single models. The cost of ensemble learning increases with the number of integrated approaches. Therefore, a trade-off between recognition accuracy and computation time should be considered, otherwise, it becomes computationally untenable for robotics applications. In this paper, we present a robust lifelong ensemble learning method that enables robots to learn new object categories over time using few-shot training instances. In particular, we consider multiple representations to encode different features of the objects. Instead of concatenating all features and train a single model, we train a model for each object representation. The category label of the target object is finally determined based on the majority vote of contributing models. Figure 2 shows an overview of our work. To assess the performance of the proposed approach, we preform extensive sets of experiments concerning recognition accuracy, scalability, robustness to the Gaussian noise and downsampling. Experimental results show that the our approach outperformed the selected stateof-the-art approaches in terms of classification accuracy and scalability. Additionally, the proposed approach demonstrated better robustness compared to other approaches under low and mid-level noise or downsampling conditions. In summary, our key contributions are threefold: • We proposed a lifelong ensemble learning method based on multiple deep and handcrafted representations for fewshot object recognition. • We generated a large synthetic household objects dataset that consists of 27000 views of 90 objects. To the benefit of research communities, we make the dataset publicly available online. • We extensively evaluated the performance of the proposed approach in offline, open-ended, and robotics scenarios. Furthermore, we assessed the robustness of our method against various level of Gaussian noise and downsampling. The development of robust 3D object representation has been a subject of research in computer vision, machine learning, and robotics communities. In general, object descriptors can be categorized in two main categories: local and global descriptors. Local descriptors encode a small area of an object around a specific key point, while global descriptors describe the entire 3D object [8] . In general, global descriptors are more efficient in terms of computation time and memory usage than local descriptors, which makes them more suitable for real-time robotic applications. When it comes to robustness to occlusion and clutter, local descriptors perform better than global descriptors [9] . Classical studies focused on handcrafted 3D object descriptors to capture the geometric essence of objects and generate a compact and uniform representation for a given 3D object [10] . For instance, Wohlkinger and Vincze [11] proposed Ensemble of Shape Function (ESF) to encode geometrical characteristics of an object using an ensemble of ten 64-bin histograms of angles, point distances, and area shape functions. In [12] , Viewpoint Feature Histogram (VFH) was presented for 3D object classification that encodes geometrical information and viewpoint. Kasaei et al., [9] introduced an object descriptor called Global Orthographic Object Descriptor (GOOD) built to be robust, descriptive and efficient to compute and use. For a detailed review of various handcrafted object descriptors, we refer the reader to a comprehensive review by Carvalho and Wangenheim [10] . In addition to handcrafted descriptors, there are several (Bayesian or deep) learning-based approaches to encode different properties of objects. In the case of Bayesian learning based features, Kasaei et al., [13] extended Latent Dirichlet Allocation (LDA) and introduced Local-LDA, which is able to learn structural semantic features for each category independently. Ayoubi et al., [14] extended the local-LDA and introduced Local-HDP, a non-parametric hierarchical Bayesian approach for 3D object categorization. In particular, the advantages of handcrafted features and the learning-based structuralsemantic features have been considered in such LDA-based approaches. Deep learning based object representation methods can learn a descriptive representation of the objects in either supervised [15] , or unsupervised [16] fashions. These approaches can be categorized into three different categories according to their input: volume-based, point-based, and viewbased approaches. Several studies showed that, among these approaches, view-based methods are the most effective in 3D object recognition [17] . In particular, view-based methods render different 2D images of a 3D object by projecting the object's points onto 2D planes [18] . The obtained views are then used for training the network [19] . These methods often require a large number of data, and long training time to achieve satisfactory results. To overcome these limitations, transfer learning approaches have been suggested [20] , [21] . In this paper, we consider both handcrafted descriptors and multi-view deep transfer-learning-based representations in the context of lifelong ensemble learning for 3D object recognition. The performance of the deep 3D object recognition approach is heavily dependent on the quantity and quality of training data. The process of annotating data is also laborintensive and expensive. Catastrophic interference/forgetting is another important limitation of deep approaches [22] , [23] . Furthermore, the scarcity of data in some categories severely limits the application of deep learning methods. Several active, few-shot, and ensemble learning approaches have been proposed in order to overcome these limitations. Most active learning (AL) methods aim to update the known model and achieve a certain recognition accuracy by seeking the minimal training example [24] - [28] . In particular, a subset of training examples is first selected from the unlabeled pool of data, by using an acquisition function based on either uncertainty measures or density/geometric similarity measures in feature space. After that, the samples are labeled by an oracle. Lastly, the new information is incrementally incorporated into the model or re-trained from scratch without creating a catastrophic result [29] , [30] . These approaches are incremental rather than lifelong since the number of categories is predefined in advance. In the case of few-shot learning, the agent initially has access to a large training dataset, D train , to learn a representation function, f . During the testing phase, the agent is given a new dataset (D f ew ), consisting of samples from a new out-ofthe-box distribution, to learn and recognize a set of new object categories. Note that D train has no known relationship to D f ew (i.e., cross-domain learning). Several studies have shown that the best approach to few-shot learning is to use highquality feature extractor rather than complicated meta-learning (i.e., learning to learn) algorithms [31] - [33] . While we agree with this point, we believe that one representation is insufficient to cover a variety of distributions and the performance can be improved with an ensemble of multiple (deep and handcrafted) representations. There some works that address ensemble of deep networks for few-shot classification [34] - [36] . These approaches are computationally very expensive in both training and testing phases, hence, it is challenging to use them in real-time robotic applications. One way to reduce overhead at the test time, is to train another network to imitate the behavior of the ensemble (i.e., distillation of an ensemble into a single model [37] ). Applying distillation in lifelong learning scenarios is a very challenging task. Unlike these approaches, we proposed lifelong ensemble learning based on multiple handcrafted and deep representations to tackle realtime 3D object recognition tasks. As shown in Fig. 2 , our approach receives the point cloud of an object and produces multiple handcrafted and deep representations for the given object. The obtained representations are then used to train different classifiers in a lifelong learning setting. As opposed to sampling and labeling the training data in advance, we propose to iteratively and adaptively determine which training instance should be labeled next. In particular, we follow an active learning scenario, in which a human user is involved in the learning loop to teach new categories or to provide feedback in an online manner. In our approach, all classifiers are equivalent, and a majority voting scheme is used as the aggregation function (each classifier votes for one hypothesis). In the following subsections, we discuss our approach in detail. A point cloud of an object consists of a set of points, p i : i ∈ {1, . . . , n} where each point is represented by its 3D coordinate [x, y, z] and RGB information [38] . We intend to encode both textural and geometrical features of objects using handcrafted shape descriptors and deep transfer learning approaches. Among all possible handcrafted 3D shape descriptors, we consider GOOD [9] , ESF [11] , VFH [12] , and GRSD [39] descriptors to encode the geometrical properties of the object. All of these descriptors showed very good descriptiveness power, scalability, and efficiency (concerning both memory footprint and computation time) for real-time robotics applications. To encode the texture of the object, we first render the orthographic view of the object. Towards this goal, we construct a local reference frame of the given object. In particular, the geometric center of the object is first computed by arithmetically averaging all points of the object. Then, we apply principal component analysis (PCA) on the point cloud of the object to find three eigenvectors, [v 1 , v 2 , v 3 ], sorted by eigenvalues, λ 1 ≥ λ 2 ≥ λ 3 of the object. Therefore, the v 1 , v 2 , v 3 eigenvectors are considered as X, Y, Z axes of the object respectively. The object is finally transformed to be placed in the obtained reference frame. For rendering 2D RGB-D images from a 3D object, we set a"virtual" camera in a way that the Z axis of the camera points towards the centroid of the object. Furthermore, the Z axis of the camera is perpendicular to the first two axes of the object, and is parallel to the third axis. Finally, a ray-tracing technique is used to render depth and RGB images of the object, regardless of how accurate/complete the point cloud of the object is. The obtained images are then fed into a deep Convolutional Neural Network (CNN) to extract a deep representation of the object. We first remove the the topmost classification layers of the network, and the then pass both RGB and depth views of the object into networks to receive an embedding for each image. The obtained representations are then concatenated to form a deep representation of the object and used to train a classifier. An illustrative example of this procedure is shown in Fig. 2 , where we generated orthographic views for the scissors. For many robotics applications, it is crucial that the robot be able to learn and recognize objects based on a few training data [40] . The problem becomes even more complicated when the robot must learn new categories online, while still maintaining the recognition performance on previously learned categories. Consider an example where a robot faces a new object while working in the environment. In such situations, the robot should be able to learn the new category on-site by observing a few examples of the object, without accessing the old training data and re-training from scratch. This point can be addressed trough lifelong learning, where, the number of categories increases over time based on the robot's observations and interactions with humans. We have tackled this problem by training a set of independent Instance Based Learning (IBL) models based on multiple representations. In particular, we form a category by set of known instances, where o i is the representation of an object view. The robot can learn a new category or update the model of an existing category by interacting with a user. Although training phase is very fast, the robot should avoid storing redundant instances as it increases memory usage and slows down the recognition process. Similar to [28] , we follow a user centered labeling strategy where the teach and correct actions by the user trigger the robot to store a new instance of a specific category in the memory. Note that the robot can learn object categories in a self-supervise manner. For instances, if the dissimilarity of the given object with all known categories is above a certain threshold, the robot infers that the object does not belong to m known categories and thus initiates a new object category called "category m + 1". The problem is that such a category label is not meaningful for a human user, and the user cannot naturally instruct the robot to perform a task, such as "bring me a cup of tea, please!". For recognition purposes, we use K nearest neighbors classifier [41] . Therefore, each IBL approach can be considered as a combination of a category model and a dissimilarity measure. Since objects are represented as global features (histograms), the dissimilarity between two objects can be computed by various distance functions. In this paper, we investigate the effect of 10 different distance functions on object recognition performance. A majority voting scheme is then used to combine the predictions of ensemble members to form the final prediction. In the event of a tie vote, the class that has the minimum distance to the query object is selected as the winner (see Fig. 2 ). We performed several experiments to assess the performance of the proposed approach. In this section, we first present a detailed description of the datasets used for evaluation purposes, followed by analyses of the various round of experiments in an offline, open-ended settings. Additionally, we demonstrate the performance of our approach in real-time through a series of robotic demonstrations. To capture partial views of the target object, we rotate and move the object in a rose trajectory in front of the camera. As shown in Fig. 3 (top-row), we developed a simulation environment in Gazebo to record a large synthetic object dataset. Towards this goal, we considered 90 simulated household objects, imported from different resources (e.g., the YCB dataset [24] , Gazebo repository, and etc). It should be noted that this is a very challenging dataset for object recognition tasks since we include both basic-level (i.e., objects that are not similar to each other such as apple vs. book) and finegrained (the object that are very similar together spoon vs. fork) object categories. Furthermore, there are several objects with the same geometry, but different textures, and vice versa. In order to extract partially visible point clouds of an object, we move the object along a rose trajectory in front of the camera and record 300 views of the object (Fig. 3 lowerrow). The obtained 27000 partial views are then organized into 90 object categories 1 . The synthetic object dataset is used not only to select the optimal parameters for the proposed method, but also to assess the performance of other state-of-the-art approaches and comparison purposes. Furthermore, we used two real object datasets. In particular, we used one large-scale and one smallscale real object datasets, including Washington RGB-D object dataset [42] and Restaurant object dataset [43] . The former one 1 contains 250000 views from 300 common household objects, organized into 51 categories, whereas the later has only 350 views of objects that are categorized into 10 categories with significant intra-class variations. This makes it a good choice for conducting extensive set of offline experiments. Several experiments were carried out to evaluate the performance of the proposed approach concerning descriptiveness, scalability, robustness to Gaussian noise and downsampling, and computation efficiency. In this round of experiments, we used a 10-fold cross validation algorithm as evaluation protocol [44] . In each iteration, one fold is used as test data, and the remaining nine folds are used for training. The process of cross-validation is then repeated 10 times, and each fold is only used once as the test data. 1) Evaluation of individual representations: We evaluated the recognition accuracy of several handcrafted descriptors and deep learning representations on Restaurant Object Dataset to find out a set of good IBL methods to construct our ensemble approach. In the case of handcrafted approaches, we considered ESF [11] , VFH [12] , GOOD [9] , and GRSD [39] descriptors, and for deep learning representations, we evaluated Resnet50 [45] , Inception [46] , DenseNet [47] , Inception-ResNet (Incept-Res) [48] and VGG19 [49] . Each of these descriptors has its own parameters that should be optimized to provide a good balance among recognition accuracy, memory usage, and computation time. Furthermore, as mentioned earlier, the distance functions plays a significant role in IBL methods. Therefore, we evaluated the impact of 10 different distance functions including Cosine, Gower, Motyka, Euclidean, Dice, Sorensen, Pearson, Neyman, Bhattachayya (Bahatta.), KL-divergance (KL div.) on recognition accuracy. We use the Average Class Accuracy (ACA) as a means of deciding which performed better. In particular, ACA= 1 K is the number of classes in the dataset, and accuracy of each class is calculated as # true predictions # testsamples . We performed a grid search to determine the optimal hyper-parameters for each approach. The GOOD descriptor has a parameter called number of bins that has effect on accuracy and efficiency. As the number of bins increases, within a certain range, the accuracy of GOOD increases. Note that, by increasing the number of bins, the computation time is also increased [9] . The best results were achieved by setting number of bins to 9. The ESF descriptor does not have any parameter while VFH and GRSD have a radius parameter which is used to estimate a normal vector for constructing a reference frame. We observed that setting the radius parameter to 0.006 produced the best results. In the case of deep learning approaches, the best results obtained by setting the resolution of input image to 100 × 100 pixels. The k-parameter for the KNN classifiers was set to 3 for all individual approaches, as we observed that there was generally no improvement in performance with larger values for k for most of the approaches. The remaining parameters of all descriptors are defined as the default values. Results of the best configuration of each method with various distance functions are summarized in Fig. 4 (left and center) . Comparison of the results showed that the majority of the object representation approaches achieved the best recognition accuracy based on Motyka distance function. Additionally, we measured the experiment time for each approach to define which approaches are computationally expensive (see Fig. 4 -right) . Among handcrafted approaches, GRSD descriptor performed considerably worse in terms of recognition accuracy, while ESF, GOOD, and VFH representations demonstrated a good trade-off between recognition accuracy and experiment time. Particularly for real-time applications with limited resources, handcrafted approaches provide a favorable computation time. According to these experiments, Inception-ResNet and VGG19 were computationally expensive to use in ensemble learning methods for robotics applications. In contrast, ResNet50, Inception, and DenseNet offered an acceptable trade-off between recognition accuracy and experiment time (see Fig. 4 (center and right)). It can be seen that out of all deep learning approaches, ResNet50 demonstrated the best computation time followed by Inception, and DensNet, respectively. We also observed that the computation time of deep learning approaches was significantly higher than handcrafted approaches. From these results, we came up with three ensemble methods based on (i) only handcrafted representations (ESF + GOOD + VFH), (ii) only deep representations (ResNet50 + Inception + DenseNet), and (iii) mixture of handcrafted and deep representations (ESF + GOOD + ResNet50). In the following subsections, we extensively evaluate all ensemble methods in different settings. To evaluate the scalability of ensemble approaches, the synthetic RGB-D dataset is randomly divided into 10, 30, 50, 70 and 90 categories. The performance of all methods are then measured based on "average class accuracy" and "computational time" (experiment time) achieved in a 10fold cross-validation procedure. In particular, we assessed the scalability as a function of object recognition accuracy versus number of categories. Results are reported in Fig. 5, Fig. 6 and Table I . By comparing all results, it is clear that the ensemble learning approaches outperformed individual methods in terms of object recognition accuracy. It can also be observed that all ensembles performed better than any of the individual approaches. On closer inspection, we can see that by introducing more categories, the accuracy of all approaches decreases. This is expected as the number of categories known by the system increases, the classification task becomes more challenging. Among all approaches, DenseNet and ensemble with only deep representations always maintained accuracy above 90% for all level of scalability. Ensemble with deep-only representations had marginally higher accuracy than the DenseNet. We also observed that the computation of all ensembles are substantially higher than individual base learners. Experimental results indicated that, among all ensemble methods, handcrafted one achieved the best computation time followed by mixture of handcrafted and deep representations. Moreover, we observed that ensemble of deep-only representations is computationally very expensive. To form an ensemble of mixed representations, we considered GOOD, ESF, ResNet50 since these representations demonstrated a good balance between accuracy and computation time. Table I indicates that the ensemble method based on mixture of representations performed better than the ensemble with only handcrafted representations in all level of scalability. According to these results (accuracy and computation time), ensemble learning based on multiple representations is a good choice for robotics applications, where resource budget becomes a valid consideration. In real-world settings the test data differs from the training data because of several factors, e.g., sensor noise, domain shift, gap between simulation and real-world data in case of sim-toreal transformation, and etc. The robustness of the ensemble and base learner approaches with respect to different levels of Gaussian noise [50] and varying point cloud resolutions [9] was evaluated and compared. Similar to the previous round of experiments, we followed 10-fold cross validation and used restaurant object dataset for these tests. 1) Gaussian noise: To test the robustness against Gaussian noise, we performed 10 rounds of evaluation, in which the ten levels of Gaussian noise with standard deviation ranging from 1 mm to 10 mm were added to the test data. In these experiments, Gaussian noise was independently added to each of the X, Y, and Z axes of the test object. To make the process clearer, we visualize a bottle object with three levels of Gaussian noise (i.e., σ = 3mm, σ = 6mm, σ = 9mm) in Fig. 7 . The obtained results are plotted in Fig. 9 . One important observation is that the GOOD base learner showed a stable performance under all level of noises, and outperformed all ensembles and base learners by a large margin. In contrast, the performance of all other base learners decreased drastically as the level of noise increased. Ensemble methods achieved the second best performance, while ResNet, ESF and VFH were very sensitive to Gaussian noise. By comparing the results, it is visible that ensemble methods with handcrafted, and mixed of handcrafted and deep representations, showed exceptionally better accuracy than all base learners excluding the GOOD learner. The underlying reason was that a tie vote occurred frequently, and in most of the cases, the GOOD learner (a member of ensemble) won since it produced the closest representation to the query object. Such results are explained by the fact that GOOD descriptor uses a stable, unique and unambiguous object reference frame that is not affected by noise. We also observed that VFH is highly sensitive to noise as it relies on surface normal estimation to produce the representation of the object. Since ESF uses distances and angles between randomly sampled points to produce a shape description, it is sensitive to noise. In the case of ResNet50, by applying Gaussian noise to the point cloud of the object, the extracted images of the object are changed. Therefore, the deep representation of the object differs significantly from the original representation of the object, resulting in misclassification and poor performance. Overall, these results showed that ensemble learning using mixture of deep and handcrafted representations could yield better results. 2) Varying point cloud density: To evaluate the robustness against varying point cloud density, two sets of experiments were carried out on the Restaurant object dataset. In these experiments, the density of train data was preserved as original while the density of test data was downsampled using ten different voxel sizes, ranging from 1mm to 10mm with a 1mm interval. An illustration example of a Pitcher object with three level of downsampling is shown in Fig. 8 . We summarized the obtained results in Fig. 10 . A comparison of all the results reveals that by increasing the downsampling resolution, the performance of all methods decreased significantly. Furthermore, we can see that all ensemble methods performed better than the individual base-learners in low-and middle resolution downsampling (i.e., DS ≥ 6mm), while in the case of high resolution downsampling (i.e., DS > 6mm), the GOOD descriptor and ResNet50 showed slightly better performance than the ensemble learning methods. Among handcrafted based learners, GOOD showed stable performance for all levels of downsampling while the performance of base learners with VFH and ESF descriptors dropped significantly under high levels of downsampling (see Fig. 10 left). It can be concluded from this observation that the GOOD descriptor is robust to downsampling mainly due to using a unique and stable reference frame, and also normalized orthographic projection, while VFH relies on surface normals estimation and ESF computes several statistical features (i.e., distances and angles between randomly sampled points) to generate a shape description for a given point cloud. In the case of deep representations, all base learners performed very good under low and mid level of downsampling. As the down sampling resolution was increased, the performance of all deep learningbased methods decreased (see Fig. 10 -center) . We also observed that the ensmble learning based on mixture of handcrafted and deep representations (ResNet50 + ESF + GOOD) outperformed other ensemble methods under all levels of downsampling resolutions (see Fig. 10 right). We examined the performance of all approaches over three different 3D object datasets, including synthetic dataset, Washington RGB-D object dataset [42] and Restaurant object dataset [51] . Since instance accuracy is susceptible to class imbalance, we report both instance and average class accuracies. Results are reported in Fig. 11 . In all experiments, ensemble method outperformed all base learners. On a closer inspection, we realised that on Washington RGB-D dataset, both instance and average class accuracies of ensemble method was about 4% higher than the best approach among GOOD, ESF, and ResNet50. Even though ensemble and ESF methods achieved similar average class accuracy on Restaurant object dataset (i.e. 94%), ensemble method showed a higher instance accuracy on the same dataset. Regarding the synthetic object dataset, ensemble approaches achieved the best results, followed by ResNet50, ESF, and GOOD descriptors. Furthermore, since there are several object categories that are geometrically very similar, base learners using shape-only descriptors (ESF and GOOD) could not achieve good accuracy on the Synthetic object dataset. In this round of evaluations, we assessed the proposed approaches using an open-ended evaluation protocol [51] - [54] . The main idea is to imitate the interaction of a robotic agent with the surroundings over a long period of time. In particular, a simulated user introduces new object categories on demand using three randomly selected object views, in the same streaming fashion as would happen in any open-ended environment. The simulated user will then ask the agent to classify a set of never-before-seen instances of the known categories to determine whether the agent is accurate or not. The simulated user provides corrective feedback when the agent misclassifies an object. This way, the agent can improve the model of that specific category using the misclassified object (test-then-train). The simulated user keeps training and evaluating the known category models until a certain protocol threshold τ has been reached, after which a new category is introduced. We set τ to 0.67, meaning that the object recognition accuracy of the agent is at least twice as high as its error rate. As soon as a new category is taught, the simulated user tests the agent's performance on all previously learned categories to ensure that no catastrophic interference has occurred. This way the agent could gradually increase its knowledge within a specific context. In case the model is unable to reach the protocol threshold after a certain number of question/correction iterations (QCI), e.g. 100, simulated user stops the experiment, as the agent is unable to learn more categories. An alternative stopping condition would be the "lack of data", where the agent learns all categories before reaching the point where no more categories can be learned (see Fig. 12 ). Although a human user could also follow this protocol, following the protocol by a simulated user allows us to perform consistent and reproducible experiments in a fraction of the time a human would take to do the same experiment. 1) Dataset and evaluation metrics: We used the synthetic object dataset (90 categories), and the Washington RGB-D Fig. 12 . Abstract architecture of interaction between the simulated user and the learning agent: The simulated user is connected to a large 3D object dataset and can interact with the learning agent using three actions: teach to introduce new category, ask to assess the performance of the agent, and correct to provide feedback in case of misclassification. object dataset [42] (51 categories). We assess the performance of the agent using five metrics [55] : • Average number of question/correction iterations (QCI) required to learn the categories. • Average number of stored instances per category (AIC). Both QCI and AIC metrics indicate time and memory usage required to learn ALC categories. • Average number of learned categories (ALC). To make a fair comparison, the order in which new instances and categories are introduced should be the same for all methods. In each round of experiments, the agent begins with no prior knowledge, and the order of introducing the instances and categories is randomly generated. The experiment is repeated ten times, and the average value of each metric is reported. 2) Results: To gain a better understanding of the openended learning process, we plotted the performance of the agent in the initial 300 question/correction iterations in Fig. 13 . Initially, the simulated user taught two object categories (spoon and toy-ambulance) to the agent, and progressively evaluated the classification accuracy of the agent after introducing a new object category. If the accuracy of the agent is greater than the protocol threshold (marked by the horizontal green line), a new category is introduced (gelatin-box). The accuracy of the agent remained 100% for the first five categories. After teaching the coke-can category, some misclassification happened and the simulated user provided corrective feedback accordingly. The agent then improved the model of the category and consequently improved its performance. This pattern repeated for three consecutive categories. Upon introducing the juice-box and master-chef-can categories, the agent recognized all known categories without any misclassification, whereas after teaching the bowl category, some misclassifications occurred. Again, the agent improved its knowledge after receiving some feedback. Furthermore, it can be seen that this model could learn many more categories after the first 300 question/correction iterations, given that the protocol accuracy consistently stays above the protocol threshold τ . We observed that the agent could learn all 90 synthetic object categories Fig. 13 . Evolution of protocol accuracy over first 300 iterations: in this experiment, the simulated user interactively teaches a new object category to the agent and evaluates its performance on all previously learn categories. In particular, if the performance of the agent exceeds a protocol threshold (i.e., set to 0.67, as shown by the green horizontal dashed line), the simulated user introduces a new category (shown by a red dashed line and a category label) and examines the recognition accuracy of the agent on all previously learned categories using never-seen-before instances. after around 4120 QCI iterations. We performed 10 round of experiments to evaluate the performance of the proposed ensemble approaches. A detailed summary of all experiments is reported in Table II . Figure 14 shows the recognition performance of the agent versus the number of categories learned. It can be concluded from this plot that the global accuracy of all approaches decreases as the number of categories increased. This is expected since the object recognition task is harder when there are more categories. We also observed that all approaches showed a good performance and the global classification accuracy of the agent remained above the protocol threshold. Overall, the ensemble approach with mixed representations achieved better global classification accuracy than the handcrafted-only and deep-only approaches. Figure 15 shows the average number of instances stored in each category using (top-row) Washington and (lower-row) Synthetic object datasets. In these plots, each bar represents the accumulation of three instances provided at the time the category was introduced and the instances that were corrected by the simulated user somewhere during the experiment. By comparing the results, it is visible that, on average, the ensemble with deep-only representations stored more instances than ensemble with mixed and ensemble with handcraftedonly representations (see Table II ). It can be concluded that the ensemble approach with handcrafted-only and mixed representations outperform ensemble with deep-only representations in terms of QCI (time) and AIC (memory) metrics. On a closer inspection, we observed that as the number of categories increased, misclassifications happened more frequently. Consequently, the agent received more corrective feedback to update the model of categories. It is worth mentioning that the Synthetic dataset contains fine-grained objects, such as a Pringles-onion vs. Pringles-hot. Such similarities undoubtedly make the classification task more challenging. Therefore, on average, experiments on the Synthetic dataset took longer QCI. We also compared the open-ended performance of our methods with five state-of-the-art approaches using the Washington dataset. Results are summarized in Table II . We observed that RACE [56] , BoW [57] , Open-ended LDA [58] , Local-LDA [25] were not able to learn all categories. Furthermore, the obtained results indicated that all of our ensemble methods and OrthographicNet [54] were able to learn all 51 categories and achieved a significantly higher GCA accuracy compared to the protocol threshed (0.67). According to these results, these methods are able to learn many more categories. Regarding QCI, the ensemble with mixed representations was able to learn all categories slightly faster than the other methods (i.e., required fewer iterations to learn all the categories). In terms of AIC, ensemble methods outperformed other approaches. More specifically the agent with ensemble of handcraftedonly and mixed representations, on average, required 7.70 and 7.26 instances per category to learn all 51 categories, while the agent with OrthographicNet and ensemble of deep-only representations required 8.97 and 8.88 instances per category, respectively. 3) Effect of protocol threshold on performance: The protocol threshold (τ ) defines how accurately the agent should learn To show the real-time performance of the proposed ensemble of mixed representations, we integrate our approach into a robotic system [59] [60] . We performed two sets of robotic demonstrations, one set in simulation, and the other Fig. 16 . The recognition results of the robot during the "set-the-table" task: (left) initially the robot recognized all objects as unknown since it did not have any information about the objects; (right) the user then taught the objects to the robot using the graphical menu, and the robot conceptualized and recognized all object correctly. set on a real robot. Our simulated and real-robot experimental setups are shown in Fig. 17 and Fig. 18 , respectively. In these experiments, the robot initially did not know any objects in advance and hence, recognized all objects as "unknown" (see Fig. 16 -left) . A human user then introduced the objects to the robot using a graphical menu. The robot conceptualized all objects and could recognize them correctly (see Fig. 16 right). To complete a task successfully, the robot should grasp and manipulate the target object to the desired location. In the case of simulation experiments, we considered "setthe-table" task. In particular, we randomly placed three objects (e.g., spray, power drill, scissors) on top of the table and then instructed the robot to perform "set-the-table" task by putting all objects in predefined locations (see Fig. 17 ). We repeated this experiment 20 times with diffident sets of objects. We observed that the robot was able to learn and recognize all Fig. 17 . A sequence of snapshots showing the performance of our dual-arm robot in "set-the-table" task: Initially, the robot does not have any information about the objects. We randomly place three tool objects (e.g., spray, power drill, scissors) on top of the table, and then teach the robot about the objects using a graphical user interface. The robot learns and recognizes all object categories correctly. Afterward, we instruct the robot to perform "set-the-table" task. The robot then iteratively picks the nearest object to the robot and places it in a pre-defined position. Fig. 18 . A series of snapshots demonstrating the performance of our dual-arm robot in a robot-assisted packaging scenario: We used three objects that are geometrically similar (i.e., cylindrical) but have different textures. The objects are not reachable by human user, and therefore, the robot should handover the requested object to the user. The robot must detect the pose and label of every object correctly to perform the experiment successfully. objects precisely and place them in the desired locations. In real-robot experiments, we evaluated the proposed approach in the context of "robot-assisted-packaging" scenarios, where our dual-arm robot handovers a target object to a human co-worker for packaging purposes (see Fig. 18 ). In this round of experiments, we randomly placed three objects on top of the table which are initially not reachable by the human user. The selected objects were geometrically very similar (e.g., spray, milk bottle, and glue bottle), making the few-shot learning and classification task more challenging (see Fig. 19 ). In these experiments, the user first taught the label of objects to the robot, and then iteratively asked the robot to hand over a specific object. To accomplish the task successfully, the robot must initially detect and recognize all objects correctly, then pick the target object, and deliver it to the user. The user finally put the object into the box. Note that the robot used its right arm to manipulate the target object if it was on the right side of the robot, otherwise, the robot used its left arm. We repeated this experiment with various sets of objects 10 times. We observed that the robot was able to learn and recognize all objects correctly and deliver them to the user successfully. A video of these experiments is available online at: https://youtu.be/nxVrQCuYGdI In this paper, we present lifelong ensemble learning approaches based on multiple representations to handle few-shot object recognition. More specifically, we formulate several ensemble methods based on deep-only, handcrafted-only, and a mixture of handcrafted and deep representations. To facilitate open-ended learning, each of the base learners has a memory unit to store and retrieve object category information instantly. To assess the proposed approach in offline, and open-ended settings, we generated a large synthetic object dataset, consisting of 27000 views of 90 household object categories. Furthermore, we used two real 3D object datasets. Experimental results showed the effectiveness of the ensemble approaches on (few-shot) 3D object recognition tasks, as well as their superior lifelong learning performance over the state-of-the-art approaches. Among the ensemble learning methods, although the ensemble method with deep-only representations achieved slightly better recognition accuracy, it was computationally very expensive for robotics applications. We observed that the ensemble method with mixture of handcrafted and deep representations showed a better trade-off between accuracy and computation time. We also demonstrated the real-time performance of the proposed approach in a set of real and simulated robotics experiments. In the simulation, we assessed the performance of the robot in the context of "set-the-table" task, while in the case of real-robot, we considered a "robotassisted-packaging" scenario. In both sets of experiments, the robot could learn and recognize all objects correctly and accomplish the tasks successfully. In the continuation of this work, we would like to investigate the possibility of improving performance by considering contextual information in ensemble methods. Visual grounding and reasoning would be another interesting avenue for future research, where the robot segments an object from a crowded scene given a natural language description. Fig. 19 . The perception of the robot during the "robot-assisted-packaging" scenario: The robot's workspace is shown by the green convex-hull. The pose of each object is shown by the bounding box and its reference frame. The recognition results are visualized above each object. Human-aware robot navigation: A survey A survey on deep learningbased fine-grained object classification and semantic segmentation Scale-dependent 3d geometric features A texture approach to leukocyte recognition Geometric and textural blending for 3d model stylization Ensemble learning: A survey A survey on ensemble learning The state of lifelong learning in service robots Good: A global orthographic object descriptor for 3d object recognition and manipulation 3d object recognition and classification: a systematic literature review Ensemble of shape functions for 3d object classification Fast 3d recognition and pose using the viewpoint feature histogram Local-lda: Open-ended learning of latent topics for 3d object recognition Localhdp: Interactive open-ended 3d object category recognition in realtime robotic scenarios Supervised learning: classification View inter-prediction gan: Unsupervised representation learning for 3d shapes by learning global shape memories to support local view predictions Review of multi-view 3d object recognition methods based on deep learning Multi-view convolutional neural networks for 3d shape recognition Rotationnet: Joint object categorization and pose estimation using multiviews from unsupervised viewpoints Transfer learning for image classification A comprehensive survey on transfer learning A systematic literature review on transfer learning for 3d-cnns An empirical investigation of catastrophic forgetting in gradient-based neural networks Active learning for convolutional neural networks: A core-set approach Hierarchical object representation for open-ended object category learning and recognition Active learning for imbalanced datasets Viewal: Active learning with viewpoint entropy for semantic segmentation Simultaneous multi-view object recognition and grasping in open-ended domains Riemannian walk for incremental learning: Understanding forgetting and intransigence icarl: Incremental classifier and representation learning Big transfer (bit): General visual representation learning Few-shot image classification: Just use a library of pre-trained feature extractors and a simple classifier Tadam: Task dependent adaptive metric for improved few-shot learning Diversity with cooperation: Ensemble methods for few-shot classification 3d object recognition with ensemble learning-a study of point cloud-based deep learning models Deep ensembles for low-data transfer learning Ensemble distribution distillation Rgbd datasets: Past, present and future Hierarchical object geometric categorization and appearance classification for mobile manipulation Generalizing from a few examples: A survey on few-shot learning K-nearest neighbors A large-scale hierarchical multiview rgb-d object dataset Interactive open-ended learning for 3d object recognition: An approach and experiments Cross-validation Deep residual learning for image recognition Rethinking the inception architecture for computer vision Densenet: Implementing efficient convnet descriptor pyramids Inception-v4, inception-resnet and the impact of residual connections on learning Very deep convolutional networks for large-scale image recognition A method for estimation and filtering of gaussian noise in images Interactive open-ended learning for 3d object recognition: An approach and experiments Using spoken words to guide openended category formation Coping with context change in open-ended object recognition without explicit context information Orthographicnet: A deep transfer learning approach for 3-d object recognition in open-ended domains Concurrent learning of visual codebooks and object categories in open-ended domains 3d object perception and perceptual learning in the race project Towards lifelong assistive robotics: A tight coupling between object perception and manipulation Online learning for latent dirichlet allocation Towards lifelong assistive robotics: A tight coupling between object perception and manipulation Mvgrasp: Real-time multi-view 3d object grasping in highly cluttered environments We thank the Center for Information Technology of the University of Groningen for their support and for providing access to the Peregrine high performance computing cluster. Songsong Xiong is funded by the China Scholarship Council.