key: cord-0303011-ieub0qcn
authors: Garg, Sourav; Sunderhauf, Niko; Dayoub, Feras; Morrison, Douglas; Cosgun, Akansel; Carneiro, Gustavo; Wu, Qi; Chin, Tat-Jun; Reid, Ian; Gould, Stephen; Corke, Peter; Milford, Michael
title: Semantics for Robotic Mapping, Perception and Interaction: A Survey
date: 2021-01-02
journal: nan
DOI: 10.1561/2300000059
sha: b47fb606375ab162057940cc5c092b7cb84754de
doc_id: 303011
cord_uid: ieub0qcn

For robots to navigate and interact more richly with the world around them, they will likely require a deeper understanding of the world in which they operate. In robotics and related research fields, the study of understanding is often referred to as semantics, which dictates what does the world"mean"to a robot, and is strongly tied to the question of how to represent that meaning. With humans and robots increasingly operating in the same world, the prospects of human-robot interaction also bring semantics and ontology of natural language into the picture. Driven by need, as well as by enablers like increasing availability of training data and computational resources, semantics is a rapidly growing research area in robotics. The field has received significant attention in the research literature to date, but most reviews and surveys have focused on particular aspects of the topic: the technical research issues regarding its use in specific robotic topics like mapping or segmentation, or its relevance to one particular application domain like autonomous driving. A new treatment is therefore required, and is also timely because so much relevant research has occurred since many of the key surveys were published. This survey therefore provides an overarching snapshot of where semantics in robotics stands today. We establish a taxonomy for semantics research in or relevant to robotics, split into four broad categories of activity, in which semantics are extracted, used, or both. Within these broad categories we survey dozens of major topics including fundamentals from the computer vision field and key robotics research areas utilizing semantics, including mapping, navigation and interaction with the world. The survey also covers key practical considerations, including enablers like increased data availability and improved computational hardware, and major application areas where...

For robots to move beyond the niche environments of fulfilment warehouses, underground mines and manufacturing plants into widespread deployment in industry and society, they will need to understand the world around them. Most mobile robot and drone systems deployed today make relatively little use of explicit higher level "meaning" and typically only consider geometric maps of environments, or three dimensional models of objects in manufacturing and logistics contexts. Despite considerable success and uptake to date, there is a large range of domains with few, if any, commercial robotic deployments; for example: aged care and assisted living facilities, autonomous on-road vehicles, and drones operating in close proximity to cluttered, human-filled environments. Many challenges remain to be solved, but we argue one of the most significant is simply that robots will need to better understand the world in which they operate, in order for them to move into useful and safe deployments in more diverse environments. This need for understanding is where semantics meet robotics.

Semantics is a widely used term, not just in robotics but across fields ranging from linguistics to philosophy. In the robotics domain, despite widespread usage of semantics, there is relatively little formal definition of what the term means. In this survey, we aim to provide a taxonomy rather than specific definition of semantics, and note that the surveyed research exists along a spectrum from traditional, non-semantic approaches to those which are primarily semantically-based. Broadly speaking, we can consider semantics in a robotics context to be about the meaning of things: the meaning of places, objects, other entities occupying the environment, or even the language understand the level of depth required for a given Neural Network (NN) application. With a vision of creating general purpose learning algorithms, they highlighted the need for a brain-like learning system following the rules of fractional neural activation and sparse neural connectivity. [12] presented a more focused review of deep learning, surveying generic object detection. They highlighted the key elements involved in the task such as the accuracy-efficiency trade-off of detection frameworks, the choice and evolution of backbone networks, the robustness of object representation and reasoning based on additionally available context. A significant body of work has focused on extracting more meaningful abstractions of the raw data typically obtained in robotics such as 3D point clouds. Towards this end, a number of surveys have been conducted in recent years for point cloud filtering [13] and description [14] , 3D shape/object classification [15] , [16] , 3D object detection [15] [16] [17] [18] [19] , 3D object tracking [16] and 3D semantic segmentation [15] [16] [17] , [20] [21] [22] [23] [24] . With only a couple of exceptions, all of these surveys have particularly reviewed the use of deep learning on 3D point clouds for respective tasks. Segmentation has also long been a fundamental component of many robotic and autonomous vehicle systems, with semantic segmentation focusing on labeling areas or pixels in an image by class type. In particular the overall goal is to label by class, not by instance. For example, in an autonomous vehicle context this goal constitutes labeling pixels as belonging to a vehicle, rather than as a specific instance of a vehicle (although that is also an important capability). The topic has been the focus of a large quantity of research with resulting survey papers that focus primarily on semantic segmentation, such as [22] , [25] [26] [27] [28] [29] [30] .

Beyond these flagship domains, semantics have also been investigated in a range of other subdomains. [31] reviewed the use of semantics in the context of understanding human actions and activities, to enable a robot to execute a task. They classified semantics-based methods for recognition into four categories: syntactic methods based on symbols and rules, affordance-based understanding of objects in the environment, graph-based encoding of complex variable relations, and knowledge-based methods. In conjunction with recognition, different methods to learn and execute various tasks were also reviewed including learning by demonstration, learning by observation and execution based on structured plans. Likewise for the service robotics field, [32] presented a survey of visionbased semantic mapping, particularly focusing on its need for an effective human-robot interface for service robots, beyond pure navigation capabilities. [33] also surveyed knowledge representations in service robotics.

Many robots are likely to require image retrieval capabilities where semantics may play a key role, including in scenarios where humans are interacting with the robotic systems. [34] and [35] surveyed the "semantic gap" in current content-based image retrieval systems, highlighting the discrepancy between the limited descriptive power of low-level image features and the typical richness of (human) user semantics. Bridging this gap is likely to be important for both improved robot capabilities and better interfaces with humans. Some of the reviewed approaches to reducing the semantic gap, as discussed in [35] , include the use of object ontology, learning meaningful associations between image features and query concepts and learning the user's intention by relevance feedback. This semantic gap concept has gained significant attention and is reviewed in a range of other papers including [36] [37] [38] [39] [40] [41] [42] . Acting upon the enriched understanding of the scene, robots are also likely to require sophisticated grasping capabilities, as reviewed in [43] , covering vision-based robotic grasping in the context of object localization, pose estimation, grasp detection and motion planning. Enriched interaction with the environment based on an understanding of what can be done with an object -its "affordances" -is also important, as reviewed in [44] . Enriched interaction with humans is also likely to require an understanding of language, as reviewed recently by [45] . This review covers some of the key elements of language usage by robots: collaboration via dialogue with a person, language as a means to drive learning and understanding natural language requests, and deployment, as shown in application examples.

The majority of semantics coverage in the literature to date has occurred with respect to a specific research topic, such as SLAM or segmentation, or targeted to specific application areas, such as autonomous vehicles. As can be seen in the previous section, there has been both extensive research across these fields as well as a number of key survey and review papers summarizing progress to date. These deep dives into specific sub-areas in robotics can provide readers with a deep understanding of technical considerations regarding semantics in that context. As the field continues to grow however there is increasing need for an overview that more broadly covers semantics across all of robotics, whilst still providing sufficient technical coverage to be of use to practitioners working in these fields. For example, while [1] extensively considers the use of semantics primarily within SLAM research, there is a need to more broadly cover the role of semantics in various robotics tasks and competencies which are closely related to each other. The task, "bring a cup of coffee", likely requires semantic understanding borne out of both the underlying SLAM system and the affordance-grasping pipeline. This survey therefore goes beyond specific application domains or methodologies to provide an overarching survey of semantics across all of robotics, as well as the semantics-enabling research that occurs in related fields like computer vision and machine learning. To encompass such a broad range of topics in this survey, we have divided our coverage of research relating to semantics into a) the fundamentals underlying the current and potential use of semantics in robotics, b) the widespread use of semantics in robotic mapping and navigation systems, and c) the use of semantics to enhance the range of interactions robots have with the world, with humans, and with other robots.

This survey is also motivated by timeliness: the use of semantics is a rapidly evolving area, due to both significant current interest in this field, as well as technological advances in local and cloud compute, and the increasing availability of data that is critical to developing or training these semantic systems. Consequently, with many of the key papers now half a decade old or more, it is useful to capture a snapshot of the field as it stands now, and to update the treatment of various topic areas based on recently proposed paradigms. For example, this paper discusses recent semantic mapping paradigms that mostly post-date key papers by [1] , [7] , such as combining single-and multi-view point clouds with semantic segmentation to directly obtain a local semantic map [46] [47] [48] [49] [50] . Whilst contributing a new overview of the use of semantics across robotics in general, we are also careful to adhere where possible to recent proposed taxonomies in specific research areas. For example, in the area of 3D point clouds and their usage for semantics, within Section III-C, with the help of key representative papers, we briefly describe the recent research evolution of using 3D point cloud representations for learning object-or pixel-level semantic labeling, in line with the taxonomy proposed by existing comprehensive surveys [15] [16] [17] , [22] , [23] . Finally, beyond covering new high level conceptual developments, there is also the need to simply update the paper-level coverage of what has been an incredibly large volume of research in these fields even over the past five years. The survey refers to well over a hundred research works from the past year alone, representative of a much larger total number of research works. This breadth of coverage would normally come at the cost of some depth of coverage: here we have attempted to cover the individual topics in as much detail as possible, with over 900 referenced works covered in total. Where appropriate we also make reference to prior survey and review papers where further detailed coverage may be of interest, such as for the topic of 3D point clouds and their usage for semantics.

Moving beyond single application domains, we also provide an overview of how the use of semantics is becoming an increasingly integral part of many trial (and in some cases full scale commercial) deployments including in autonomous vehicles, service robotics and drones. A richer understanding of the world will open up opportunities for robotic deployments in contexts traditionally too difficult for safe robotic deployment: nowhere is this more apparent perhaps than for on-road autonomous vehicles, where a subtle, nuanced understanding of all aspects of the driving task is likely required before robot cars become comparable to, or ideally superior to, human drivers. Compute and data availability has also enabled many of the advancements in semantics-based robotics research; likewise these technological advances have also facilitated investigation of their deployment in robotic applications that previously would have been unthinkable -such as enabling sufficient on-board computation for deploying semantic techniques on power-and weight-limited drones. We cover the current and likely future advancements in computational technology relevant to semantics, both local and online versions, as well as the burgeoning availability of rich, informative datasets that can be used for training semantically-informed systems.

In summary, this survey paper aims to provide a unifying overview of the development and use of semantics across the entire robotics field, covering as much detailed work as feasible whilst referencing the reader to further details where appropriate. Beyond its breadth, the paper represents a substantial update to the semantics topics covered in survey and review papers published even only a few years ago. By surveying the technical research, the application domains and the technology enablers in a single treatment of the field, we can provide a unified snapshot of what is possible now and what is likely to be possible in the near future.

Existing literature covering the role of semantics in robotics is fragmented and is usually discussed in a variety of task-and application-specific contexts. In this survey, we consolidate the disconnected semantics research in robotics; draw links with the fundamental computer vision capabilities of extracting semantic information; cover a range of potential applications that typically require high-level decision making; and discuss critical upcoming enhancers for improving the scope and use of semantics. To aid in navigating this rapidly growing and already sizable field, here we propose a taxonomy of semantics as it pertains to robotics (see Figure 1 ). We find the relevant literature can be divided into four broad categories:

1) Static and Un-embodied Scene Understanding, where the focus of research is typically on developing intrinsic capability to extract semantic information from images, for example, object recognition and image classification. The majority of research in this direction uses single image-based 2D input to infer the underlying semantic or 3D content of that image. However, image acquisition and processing in this case is primarily static in nature (including videos shot by a static camera), separating it conceptually from a mobile embodied agent's dynamic perception of the environment due to motion of the agent. Because RGB cameras are widely used in robotics, and the tasks being performed, such as object recognition, are also performed by robots, advances in this area are relevant to robotics research. In Section II, we introduce the fundamental components of semantics that relate to or enable robotics, focusing on topics that have been primarily or initially investigated in non-robotics but related research fields, such as computer vision. We cover the key components of semantics as regards object detection, segmentation, scene representations and image retrieval, all highly relevant capabilities for robotics, even if not all the work has yet been demonstrated on robotic platforms. 2) Dynamic Environment Understanding and Mapping, where the research is typically motivated by the mobile or dynamic nature of robots and their surroundings. The research literature in this category includes the task of semantic mapping, which could be topological, or a dense and precise 3D reconstruction. These mapping tasks can often leverage advances in static scene understanding research, for example, place categorization (image classification) forming the basis of semantic topological mapping, or pixel-wise semantic segmentation being used as part of a semantic 3D reconstruction pipeline. Semantic maps provide a representation of information and understanding at an environment or spatial level. With the increasing use of 3D sensing devices, along with the maturity of visual SLAM, research on semantic understanding of 3D point clouds is also growing, aimed at enabling a richer semantic representation of the 3D world. In Section III, we cover the use of semantics for developing representations and understanding at an environment level. This includes the use of places, objects and scene graphs for semantic mapping, and 3D scene understanding through Simultaneous Localization And Mapping (SLAM) and point clouds processing. 3) Interacting with Humans and the World, where the existing research "connects the dots" between the ability to perceive and the ability to act. The literature in this space can be further divided into the "perception of interaction" and "perception for interaction". The former includes the basic abilities of understanding actions and activities of humans and other dynamic agents, and enabling robots to learn from demonstration. The latter encompasses research related to the use of the perceived information to act or perform a task, for example, developing a manipulation strategy for a detected object. In the context of robotics, detecting an object's affordances can be as important as recognizing that object, enabling semantic reasoning relevant to the task and affordances (e.g. 'cut' and 'contain') rather than to the specific object category (e.g. 'knife' and 'jar'). While object grasping and manipulation relate to a robot's interaction with the environment, research on interaction with other humans and robots includes the use of natural language to generate inverse semantics, or to follow navigation instructions. Section IV addresses the use of semantics to facilitate robot interaction with the world, as well as with the humans and robots that inhabit that world. It looks at key issues around affordances, grasping, manipulation, higher-level goals and decision making, human-robot interaction and vision-and-language navigation. 4) Improving Task Capability, where researchers have focused on utilizing semantic representations to improve the capability of other tasks. This includes for example the use of semantics for high-level reasoning to improve localization and visual place recognition techniques. Furthermore, semantic information can be used to solve more challenging problems such as dealing with challenging environmental conditions. Robotics researchers have also focused on techniques that unlock the full potential of semantics in robotics, since existing research has not always been motivated by or had to deal with the challenges of real world robotic applications, by addressing challenges like noise, clutter, cost, uncertainty and efficiency. In Section V, we discuss various ways in which researchers extract or employ semantic representations for localization and visual place recognition, dealing with challenging environmental conditions, and generally enabling semantics in a robotics context through addressing additional challenges. The four broad categories presented above encompass the relevant literature on how semantics are defined or used in various contexts in robotics and related fields. This is also reflected in Figure 1 through 'extract semantics' and 'use semantics' labels associated with different sections of the taxonomy. Extracting semantics from images, videos, 3D point clouds, or by actively traversing an environment are all methods of creating semantic representations. Such semantic representations can be input into high-level reasoning and decisionmaking processes, enabling execution of complex tasks such as path planning in a crowded environment, pedestrian intention prediction, and vehicle trajectory prediction. Moreover, the use of semantics is often finetuned to particular applications like agricultural robotics, autonomous driving, augmented reality and UAVs. Rather than simply being exploited, the semantic representations themselves can be jointly developed and defined in consideration of how they are then used. Hence, in Figure 1 , the sections associated with 'use semantics' are also associated with 'extract semantics'. These high-level tasks can benefit from advances in fundamental and applied research related to semantics. But this research alone is not enough: advances in other areas are critical, such as better cloud infrastructure, advanced hardware architectures and compute capability, and the availability of large datasets and knowledge repositories. Section VI reviews the influx of semantics-based approaches for robotic deployments across a wide range of domains, as well as the critical technology enablers underpinning much of this current and future progress. Finally, Section VII discusses some of the key remaining challenges in the field and opportunities for addressing them through future research, concluding coverage of what is likely to remain an exciting and highly active research area into the future.

Many of the advances in the development and use of semantics have occurred in the computer vision domain, with a heavy focus on identifying and utilizing semantics in terms of images or scenes -for example in 2D semantic segmentation, object recognition, scene representations, depth estimation, image classification and image retrieval. For the majority of these research tasks, image processing is primarily static in nature, conceptually distinct from a mobile embodied agent's dynamic perception of the environment. Nevertheless, these processes can all play a role in robotics, whether it be for a robot perceiving and understanding the scene in front of it, or searching through its map or database to recognize a place it has visited before. In this section we focus on the fundamental computer vision research involving semantics for static scene understanding which is directly relevant to robotics. The following subsections discuss semantic representations explored in the literature, firstly in the context of image classification and retrieval, followed by extracting and using semantic information in the form of objects and dense segmentation, and finally, scene representation including scene graphs and semantic generalized representations.

Image classification is the process of determining the content shown within an image, typically at a whole image level, although the process may involve detecting and recognizing specific objects or regions within the overall image. Complementing classification, image retrieval refers to the process of parsing a database of images and picking out images that meet a certain content criteria, which can be specified in a number of ways. For example, images can be retrieved based on a "query" image (find images with similar content), or based on some description of the desired content within the image (find images of trees). Image retrieval in particular has a strong overlap with visual mapping and place recognition techniques, and hence much of the mapping or spatially-specific retrieval research is covered in that section. Here we touch on classification and retrieval techniques with a focus on developments in the semantic space.

1) Image-based Classification: A typical goal of this task is to semantically tag images based on their visual content, for example, classifying an image as belonging to a category like "peacock" or "train station". The former represents the presence of an "object" within an image, whereas the latter represents a "place". In a broader sense however, image classification or recognition can be based on a hierarchy of concepts that might exist in the visual observation. [51] presented a hierarchical generative model that classifies the overall scene, recognizes and segments each object component, and annotates the image with a list of tags. It is claimed to be the first model that performs all three tasks in one coherent framework. The framework was able to learn robust scene models from noisy web data such as images and user tags from Flickr.com. [52] proposed a high-level image representation, Object Bank, based on a scale-invariant response map of a large number of pre-trained generic object detectors, for the task of object recognition and scene classification. [53] designed an approach to jointly reason about regions, location, class and spatial extent of objects, the presence of a class in the image, and the scene type. The novel reasoning scheme at the segment level enabled efficient learning and inference. [54] proposed the Fisher Vector -a state-of-the-art patch encoding technique -as an alternative to Bag of Visual Words (BoVW), where patches are described by their derivation from a universal generative Gaussian Mixture Model (GMM). [55] developed a Bag of Semantic Words (BoSW) model based on automatic dense image segmentation and semantic annotation, achieved by graph cut and (Support Vector Machine) SVM respectively. While the SVM learns one-versus-all classification for different semantic categories, the reduced vocabulary size due to BoSW decreases computation time and increases accuracy. While effective, the methods above rely on hand-designed image features that are not optimally designed for image classification. Consequently alternative approaches that can automatically learn not only the classifier, but also the optimal features, have been investigated, with methods based on deep learning techniques generally being superior, as detailed below.

Like many other research areas, deep learning has played an increasing role in image classification over the past decade. With the availability of large-scale image dataset like ImageNet [56] and the availability of Graphics Processing Units (GPUs), learning via deep Convolutional Neural Network (CNN) such as AlexNet [57] opened up enormous opportunities for enhanced scene understanding. [58] introduced the Places dataset, consisting of 7 million labeled scenes, and demonstrated state-of-the-art performance on scene-centric datasets. Through this dataset, complementing ImageNet, [58] highlighted the differences in internal representations of object-centric and scene-centric networks. [59] extended this line of work, discussing the released Places database of 10 million images and its corresponding Places-CNNs baselines for scene recognition problems. [60] showed that object detectors naturally emerge from training CNNs for scene classification. They argue that since scenes are typically composed of objects, CNNs trained on them automatically discover meaningful object detectors, representative of the learned scene category. This work then demonstrated that the same network can perform both scene recognition and object localization in a single forward-pass, without an explicitly taught notion of objects. The methods above are generally referred to as weakly supervised detectors, because they are trained with scene-level annotations but are able to localise objects within a scene. One of their main limitations was that their object detection accuracy was significantly inferior to fully supervised detection methods that relied on object localisation labels -further research in the area has focused on reducing this performance gap. A CNN visualisation technique was presented in [61] where the Class Activation Map (CAM) was based on Global Average Pooling (GAP) in CNNs. This work demonstrated that GAP enables accurate localization ability (i.e. activation of scene category-specific local image regions), despite being trained on image-level labels for scene classification.

Meng et al. [62] designed a first-of-its-kind semantic place categorization system based on text-image pairs extracted from social media. They concatenate and learn features from two CNNs -one each for the two modalities -to predict the place category. The proposed system uses a newly curated dataset with 8 semantic categories: home, school, work, restaurant, shopping, cinema, sports, and travel. [63] explored aggregation of CNN features through two different modules: a "Content Estimator" and a "Context Estimator", where the latter is comprised of three sub-modules based on text, object and scene context. The authors claim it is the first work that leverages text information from a scene text detector for extracting context information. The papers above were successful at introducing approaches that combined different modalities, but did not include geometrical information, limiting their ability to "understand" the scene. When depth information is available, including when explicitly available from range-based sensors, RGB-D-based scene recognition and classification techniques classify scene images using aligned color and depth information. Wang et al. [64] extracted and combined deep discriminative features from different modalities in a component-aware fusion manner. Gupta et al. [65] transferred the RGB model to depth net using unlabeled paired data according to their mid-level representations. More recently, Du et al. [66] presented a unified framework to integrate the tasks of cross-modal translation and modality specific recognition. While additional cues in the form of depth information or text descriptions improve image classification, the essence of the task is to enable semantic scene understanding by providing a "compressed" but meaningful representation of the visual observation, which a robot can then use to intelligently reason about the environment.

2) Semantics for Image Retrieval and Classification: Except in specific circumstances or in the case of instancelevel retrieval (finding a specific image), broader content-based image retrieval generally requires some semanticsbased knowledge of the content of images in order to be feasible. Much of the research in this area has focused on what form that knowledge should take, and how similarity should be calculated. [67] explored conceptrelationship based feature extraction and learning using textual and visual data for content based image retrieval. This was achieved by using five pre-defined specific concept relationships: complete similarity (Beijing, Peking), type similarity (Husky, Bulldog), Hypernym Hyponym (Husky, Domestic Dog), Parallel relationship (cat, dog), and unknown relationship; these conceptual semantic relationships were shown to improve retrieval performance. [68] proposed fusing low-level and high-level features from sketches and images, along with clustering-based reranking optimization for image retrieval. [69] developed a novel joint binary codes learning method that combines image features with latent semantic features (labels/tags of images); the intent was to encode samples sharing the same semantic concepts into similar codes, rather than only preserving geometric structure via hashing. [70] explored a method to retrieve images based on semantic information like "a boy jumping" or "throw", as opposed to traditional instance-level retrieval. They proposed a similarity function based on annotators' captions as a computable surrogate of true semantic similarity for learning a semantic visual representation. Learning a joint embedding for visual and textual representations improved accuracy, and enabled combined text and image based retrieval.

To reduce the semantic gap in existing visual BoW approaches, [71] extracted SIFT [72] , Local Binary Patterns (LBP) [73] , and color histogram features separately for the foreground and background in an image, segmented using the SaliencyCut method [74] . [75] presented Situate -an architecture to learn models that capture visual features of objects and their spatial configuration to retrieve instances of visual situations like "walking a dog" or "a game of ping-pong". The method actively searches for each expected component of the situation in the image to calculate a matching score. [76] explored hierarchy-based semantic embeddings for image retrieval by incorporating prior knowledge about semantic relationships between classes obtained from the class hierarchy of WordNet [77] . [78] used an end-to-end trainable CNN to learn "semantic correspondences" as dense flow between images depicting different instances of the same semantic object category, such as a bicycle. To train the CNN, they used binary segmented images with synthetic geometric transformations and a new differentiable argmax function. [79] proposed the Visual Semantic Reasoning Network (VSRN), based on connections between image regions (ROIs from F-RCNN [80] ). This approach learns features with semantic relationships from the pairwise affinity between regions using a Graph Convolutional Network. The research included both image-to-text (caption) and text-to-image (image) retrieval capabilities; the learnt representation (also visualized in 2D) captures key semantic objects (bounding boxes) and semantic concepts (within a caption) of a scene as in the corresponding text caption.

[81] designed a new two-path neural network architecture that maps images and text while enabling spatial localization in the proposed semantic-visual embedding. Using this system, text queries like "a burger", "some beans" and "a tomato", can be localized in an image comprising all of these. [82] presented a framework for event detection that uses "semantic interaction" based on pairwise affinity between semantic representations of multiple source videos. This approach enables the detection of events such as "birthday party", "grooming an animal", "town hall meeting" or "marriage proposal".

Semantic image retrieval and semantic image classification are closely related to each other, as improved methods for the latter inherently enable higher accuracy for the former. This is in part due to the high level of abstraction and reasoning capability that is attained through semantically representing images. Such an approach also bridges the gap between natural language semantics of humans and the user's intended retrieval outcome, thus reducing the semantic gap. In a world where robots and humans co-exist, reducing the semantic gap will likely lead to more seamless human-robot interaction.

While semantic reasoning at the whole image level can enable high-level decision making for a robot such as that required for path planning, semantically understanding the content within an observed image is necessary for a robot to perform a task such as manipulating an object. A key task for many robots operating in real world environments therefore is object detection and recognition. While there is no absolute definition of what constitutes an object, general definitions revolve around the concept of objects being distinct things or entities in the environment, with the ability to be seen and touched. Before moving into more sophisticated uses of objects and object understanding in the context of robotic tasks like mapping and environmental interaction, we first cover some key background.

Detection is exactly that: detecting what objects are present in a particular environment, observed through the robot's sensors. Recognition involves calculating what types of objects are present, and can be performed at both a class level -all the mugs in the room -and at instance level -a specific, single mug. Classical approaches to object recognition include template-based approaches, which match potential objects in the scene to some form of 2D or 3D exemplar of the object class; hand-crafted feature-based methods, which build object structures on top of simple edge, SIFT [83] or SURF [84] features, and direct approaches that perform matching based on pixel intensities or intensity gradients in an image. Because of the inherent difficulty of the problem in all but the simplest of circumstances, decades of research has developed enhancements to these processes including efficient tree-based search methods, various geometric matching tests and a range of other mechanisms.

With the advent of modern deep learning techniques, a number of key advances were made in CNN-based object detection approaches, including Fast R-CNN [85] , SPPNet [86] and Faster R-CNN [80] . In particular, approaches have started to investigate the use of semantics as part of the detection pipeline. [87] presented a modular scene understanding system that used semantic segmentation to improve object detection performance for out-of-thebox specific object detectors. In the 3D object detection area, [88] explored a detection and tracking pipeline for autonomous vehicles using RGB, LiDAR, and dense semantic segmentation to form semantic point clouds (voxel grids), which are then fed into Complex-YOLO to obtain 3D detections. They use Labeled Multi-Bernoulli-Filter for multi-target tracking and claim to be the first to fuse visual semantics with 3D object detection. [89] presented a generic object tracking framework that uses a semantic segmentation network for object proposals by ranking and selecting relevant feature maps from the final layer based on activations. Temporal information and template models were used to improve prediction. [90] designed a Detection with Enriched Semantics (DES) network for object detection that uses semantic segmentation as an input to the network (in a weakly supervised manner) to enrich the learnt feature maps. They also use a global activation module for extracting high-level information from higher-order layers of the network. [91] proposed an end-to-end learning framework for pedestrian attribute recognition using a Graph Convolution Network comprising sub-networks: one to capture spatial relations between image regions and another to learn semantic relationships between attributes.

The state-of-the-art 2D object detectors have also been shown to be useful in facilitating automated annotation and recovery of 3D objects when used in conjunction with sparse LiDAR data, as demonstrated recently in [92] . Pursuing this capability is partly motivated by the current requirement for enhanced 3D scene understanding to enable practical applications like autonomous driving. Given this requirement, there has been significant activity in 3D object detection research, which is covered in detail in Section III-C, especially in the context of using 3D point clouds. In robotics, object detection and recognition is fundamental to the core competencies of a robot, that is, perception and interaction. Spatially-intelligent robots would require understanding both the object of interest and everything else that surrounds it in order to successfully perform many tasks, which can only be achieved by accurate 2D and 3D object recognition and dense semantic segmentation.

Object-only semantic understanding is typically motivated by applications that define an object or region of interest; it is detected, recognized, often tracked, and in the case of robotics, grasped and manipulated. While dealing with objects, the "background" information is typically discarded. In an indoor environment where the task might be counting coffee mugs of different colors, this background information exists in the form of walls, floors, and ceilings. Similarly, for an outdoor environment, the task of pedestrian detection and tracking might consider roads and buildings as background. Beyond the notion of background, the lack of a single unique way of representing object shapes paves the way for dense image segmentation.

Like object detection, segmentation is a research topic with great relevance to robotics (especially autonomous vehicles in recent years) where much of the pioneering work was performed in the computer vision domain. The core segmentation task is straightforward to describe: partitioning an image up into segments, but the ways in which this can be achieved, and its subsequent use cases for robotics vary hugely. Segmentation is tightly coupled to semantics because the most common segmentation approaches revolve around dividing up the image into semantically meaningful areas, with objects being a common example -in an autonomous vehicle context this could be pedestrians, vehicles, and traffic signs. Dense semantic segmentation is not only often more informative than specific object detection, it also discloses various ways of leveraging additional context for performing a task, for example, utilizing spatial co-existence or physics based priors. Semantic segmentation is increasingly becoming a core part of systems in the robotics field and its development and use in mapping or interaction contexts is covered in the appropriate sections later in this survey. Here we overview the key foundational work which many of these robotic systems then build upon.

1) 2D Semantic Segmentation: The world is a three-dimensional environment but much of the key work has occurred in the 2D image co-ordinate space. Already a significant research topic before modern deep learning techniques became dominant, earlier works involved elevating the capability of coherent region-based segmentation [93] , including watershed [94] and mean-shift [95] like techniques, to a more meaningful parts-based semantic segmentation [96] [97] [98] . [96] induced semantic labeling using an example image from a common image domain based on a non-parametric model. [97] presented semantic segmentation and object labeling based on a novel region-growing algorithm, combined with context obtained from ontological knowledge representation. [98] explored a hierarchical tree based, rather than flat, partitioning of an image [99] , [100] within the scope of using graph cut [101] for image segmentation. Given that these approaches were mostly based on prior information about particular segmentation tasks, it was challenging for them to produce highly accurate segmentation results.

With the advent of modern deep learning came a proliferation of new semantic segmentation methods. Long et al. [102] presented the Fully Convolutional Network (FCN) that processes arbitrary size input images and produces correspondingly-sized output with efficient inference and learning. They also defined a skip architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow fine layer to produce accurate and detailed segmentations. Yu et al. [103] designed a CNN specifically for dense prediction, with their proposed model based on "dilated convolutions" systematically aggregating multi-scale contextual information without losing resolution. Badrinarayanan et al. [104] presented SegNet, an "encoderdecoder" architecture for semantic segmentation. The main novelty of SegNet was in the up-sampling strategy, where the decoder used the pooling indices from the encoder's max pooling step to perform non-linear upsampling, thereby eliminating the need for learning to up-sample. The system was relatively efficient in terms of memory and computation time. Another relevant model was the Unet [105] , that consisted of an encoder-decoder architecture with skip connections from the contracting path (i.e., encoder) to the expanding path (i.e., decoder), with the goal of addressing the issue of vanishing gradients. These approaches represent some of the first attempts to use deep learning for semantic segmentation, and the impressive results at that time significantly motivated the development of new deep learning approaches.

The Efficient neural Network (ENet) was proposed by Paszke et al. [106] for tasks requiring low latency operations; their network was up to 18x faster, required 75x less FLOPs, and had 79x less parameters than comparable methods at the time, while achieving similar or better accuracy. Chen et al. [107] designed a semantic segmentation network based on: 1) atrous (dilated) convolutions that enlarged the field of view of filters using the same number of parameters, 2) Atrous Spatial Pyramid Pooling (ASPP) that aided in segmenting objects at multiple scales, and 3) fully connected CRFs that improved the localization boundaries. More recently, Takikawa et al. presented the Gated-SCNN [108] : a two-branch CNN model that analyses the RGB image and an image gradient (representing the shape information of the image). These branches are connected by a new type of gating mechanism that joins the higher-level activations of the RGB image to the lower-level activations of the gradient image. Gated-SCNN [108] held the state-of-the-art segmentation results in several public computer vision datasets.

As the application of segmentation methods continued to expand into challenging scenarios such as autonomous vehicles, it became clear that improvements were needed to make them perform well in challenging or adverse environmental conditions. Sakaridis et al. [109] developed a pipeline to add synthetic fog to real clear-weather images using incomplete depth information, to address the problem of semantic scene understanding in foggy conditions. They used supervised and semi-supervised learning techniques to improve performance, introducing the Foggy Driving dataset and Foggy Cityscapes dataset. In order to deal with variations in weather conditions, [110] explored unsupervised domain adaptation to learn accurate semantic segmentation in a challenging target weather condition using an ideal source weather condition.

Integrating information over time -in an autonomous vehicle context for example over multiple camera frames -can also improve performance in challenging circumstances. Kundu et al. [111] proposed a method for longrange spatio-temporal regularization in semantic video segmentation, in contrast to a naive regularization over the video volume, which does not take into account camera and object motion. They used optical flow and dense CRF over points optimized in Euclidean feature space to improve accuracy, and demonstrated the effectiveness of the method on outdoor Cityscapes data [112] . [113] presented a Bayesian filtering approach to semantic segmentation of a video sequence, where each pixel is considered a random variable with a discrete probability distribution function. He et al. [114] presented a superpixel-based multi-view CNN for semantic segmentation that leveraged information from additional views of the scene, by first computing region correspondences through optical flow and superpixels, and then using a novel spatio-temporal pooling layer to aggregate information over space and time. The training process in this case benefited from unlabeled frames that led to improved prediction performance. More recently, in order to address the problem of data labeling for supervised semantic segmentation, [115] employed gradient and log-based active learning for semantic segmentation to distinguish between crop and weed plants in an agricultural field.

To foster research in the context of temporally-coherent semantic segmentation, [116] presented a video segmentation benchmark using the proposed DAVIS (Densely Annotated VIdeo Segmentation) dataset. The benchmark helped increase the community's interest in developing novel techniques to address this task. [117] proposed a one-shot video object segmentation method that used only a single labeled training example, that is, the first frame. In order to reduce the reliance on large-scale data, [118] leveraged existing annotated image datasets to augment video training data, enabling them to learn diverse saliency information while also preventing overfitting due to limited original data. In an extended work [119] , the authors considered the task of unsupervised video segmentation, leveraging object saliency cues to achieve temporally-consistent pixel labeling. [120] presented a learning strategy based on single image segmentation and incremental frame-by-frame processing to refine video object segmentation. While spatio-temporal cues have been demonstrated to be useful for segmentation in videos, [121] presented a learning framework to correctly group pixels based on motion, to improve object segmentation in static images.

Beyond semantic segmentation based on classes, instance-level segmentation -recognizing specific singular areas of interest -is also highly relevant for many robotic applications. Held et al. [122] developed a realtime probabilistic 3D segmentation method that combined spatial, temporal, and semantic information to help in the decision between splitting and merging of the initial coarse segmentation, significantly reducing underand over-segmentation results on the KITTI dataset. The Faster R-CNN [80] approach was also extended by He et al. [123] for instance segmentation, by adding a branch for predicting object masks in parallel with existing bounding box recognition. The proposed system demonstrated state-of-the-art results on the following COCO challenges: instance segmentation, bounding box object detection, and person keypoint detection. Bai et al. [124] presented a CNN combining classical watershed transform and modern deep learning to produce an energy map of an image, where object instances are unambiguously represented as energy basins, enabling direct extraction of high quality object instances. Methods based on metric learning [125] , [126] have been explored for instance-level segmentation, with the aim of learning a pixel-based embedding where pixels of the same instance have a similar embedding. One of the current state-of-the-art methods is the masking score RCNN (MS-RCNN) [127] , based on a network block which learns the quality of the predicted instance masks. Unifying the tasks of semantic segmentation and instance segmentation, [128] proposed a joint scene understanding task, termed "panoptic segmentation", while also defining a novel Panoptic Quality (PQ) metric to evaluate performance on this task. Panoptic segmentation has become the focus of research in many applications, in particular autonomous driving, with a number of novel techniques developed only recently [129] [130] [131] [132] [133] [134] . It is likely that future 2D segmentation approaches will target several of the goals above in a joint manner.

2) Semantic Segmentation using Depth and Other Sensing Modalities: While visual segmentation using RGB imagery or video has been a significant focus of the research community, other sensing modalities can aid in segmentation, which in turn can then help the sensing pipeline. These other modalities are not just depth-related (in the case of commonly used LiDAR sensors on autonomous vehicles for example), but can include other sensors including infrared and multispectral cameras.

Silberman et al. [135] proposed a method to parse "messy" indoor scenes into floors, walls, supporting surfaces, and object regions, while also recovering support relationships. They classified objects into "structural classes" based on their role: a) ground, b) permanent structures like walls and ceilings, c) large furniture, and d) props that are movable. For reasoning about support, they used physical constraints and statistical priors. Gupta et al. [136] designed a set of algorithmic tools for perceptual organization and recognition in indoor scenes from RGB-D data. Their system produced contour detection, hierarchical segmentation, grouping by amodal completion, object detection and semantic labeling of objects and scene surfaces. [137] explored the use of a Multi-modal stacked Auto-Encoder (MAE) [138] to jointly estimate per-pixel depth and semantic labels. The authors demonstrated that learning a shared latent representation aids in semantic scene understanding even when input data has imperfect or missing information. [139] presented UpNet, a fusion architecture that can learn from RGB, Near-InfraRed, and depth data to perform semantic segmentation. They also introduced a first-of-its-kind multispectral segmentation benchmark of unstructured forest environments. [140] explored the use of easy-to-obtain synthetic data for semantic segmentation, particularly making use of geometry in the form of synthetic depth maps. They used geometric information to first adapt the input to the segmentation network via image translation, and then adapt the output of the network via simultaneous depth and semantic label prediction. More recently, [141] addressed the problem of open-set instance segmentation by projecting point clouds into a category-agnostic embedding space, where clustering is used to perform 3D segmentation independent of semantics.

Semantic segmentation can also be used to improve the results of processing some of the sensing modalities. [142] proposed semantically selecting dynamic object instances (and removing the background) from stereo pairs of images, in order to estimate optical flow for each instance individually. The estimated flow for the background and foreground could then be merged to obtain the final, improved result. [143] explored the use of semantics to improve per-pixel disparity estimation by simultaneously predicting per-pixel semantic labels and using intermediate CNN feature embeddings from the segmentation network. [144] presented RTS2Net -a single, compact, and lightweight architecture for real-time "semantic stereo matching" (the term coined to refer to the combination of two tasks: depth estimation and semantic segmentation). The two tasks are accomplished by corresponding sub-networks, along with a disparity refinement network that uses the output from the predicted semantics. [145] explored the use of pyramid cost volumes, defined through multiple scales of feature maps, to better capture the disparity details in stereo matching, while also incorporating semantic segmentation to rectify disparity values along object boundaries. The latter task is achieved through the use of a single semantic cost volume (cost volumes are the set of correspondence costs per pixel used to infer the optimal disparity map). More recently, [146] proposed a novel architecture that leverages a pre-trained semantic segmentation network to self-supervise monocular depth estimation via pixel-adaptive convolutions.

3) Jointly Learning Semantic Segmentation with Other Tasks: The semantic segmentation process closely relates to other tasks required for robots and autonomous vehicles like depth estimation, and researchers have investigated jointly learning it with these other tasks. [147] developed a real-time implementation of ENet [106] to simultaneously address three autonomous driving related tasks: semantic segmentation, instance segmentation, and depth estimation. They proposed a shared encoder but with different decoder branches for the three tasks.

[148] presented a multi-task CNN to jointly handle a large number of functional tasks: boundary detection, normal estimation, saliency estimation, semantic segmentation, human part segmentation, semantic boundary detection, region proposal generation and object detection. They demonstrated effective scaling up of diverse tasks with limited memory and incoherently annotated datasets. [149] explored joint learning of two tasks, depth estimation and semantic segmentation, using a single model with an uneven number of annotations per modality. The latter was achieved using hard knowledge distillation through a teacher network. Through computational efficiencyrelated modifications of an existing semantic segmentation network, they achieved real time operation. They demonstrated their system's utility through 3D reconstruction based on SemanticFusion [150] for both indoor and outdoor data. [151] proposed a CNN to predict polarization information (per-pixel polarization difference) along with semantic segmentation using monocular RGB data, as opposed to using a micro-grid array polarization camera or polarized stereo camera. The work was motivated by the challenges of specular scene semantics like water hazards, transparent glass, and metallic surfaces, where polarization imaging often complements RGB semantic segmentation. Finally, as an extension of the Stixel representation [152] , [153] combined both depth and semantics per pixel for compactly representing an environment, benchmarking pixel-wise semantic segmentation and depth estimation tasks. For robots carrying a suite of sensors, joint learning based on input from these different sensing modalities can be leveraged to address increasingly complex and challenging tasks.

Semantic representation of an observed scene can be defined at various levels of detail and is typically motivated by the application scenario. While image classification produces a single semantic label for an image, object recognition and semantic segmentation provide labeling at region-and pixel-level respectively. However, a scene can be represented in distinct ways that may or may not explicitly take into account the pixel-or region-level semantics, but still have a high-level representation which is semantically meaningful. We explore these additional scene representation approaches in this section, including scene graphs and zero-shot learning of generalized semantic scene representations.

The representation of a scene involves the extraction of visual information that can summarise the contents of the image captured from a particular scene [154] . For typical computer vision and robotics problems, imagery constitutes a broad domain where the variability is unlimited and unpredictable, even for images with the same semantic meaning [154] . Historically, the main goal of an effective image representation is the reduction of the semantic gap [155] , defined as the difference between the visual information extracted from an image and its high-level semantic interpretation. Traditional image representations transformed the original image pixels into features spaces of color, shape or texture, with the goal of amplifying particular characteristics of the image that are relevant for representing the semantic information of the scene, while at the same time suppressing irrelevant information. Such a goal can be partially achieved by promoting feature invariance properties that makes the representation robust to (generally) non-relevant distortions to the image, such as geometric transformation, illumination changes, and weather variations. One important trade-off encountered in doing so was between discrimination power and robustness to distortions. So much of the classical work on the design of feature representations was significantly influenced by the goal of finding an effective operating point that balanced discriminative and invariance properties. Color invariance is typically achieved by transforming the original RGB (Red Green Blue) space into a more robust HSV (Hue Saturation Value) space (other similar spaces have also been proposed), where the hue channel is invariant to the illumination and camera direction with respect to an object's orientation [154] . The invariance to shape information is commonly obtained by computing image derivatives using scale-and rotationally-invariant operators [156] . Texture invariance is generally obtained by representing an image in the frequency domain [154] .

The classical representation of images involves the partitioning of the image into regions, and the computation of features from each region. Image partitioning can be done in a dense manner [154] , where the whole image is broken into regions of fixed [157] or variable [158] size, or in a sparse manner [72] , with the selection of salient image patches. Features (including color, shape and/or texture) are then extracted from the region and summarised into a fixed dimension representation using, for example, a particular spatial distribution [159] or a histogram [160] . Images can then be represented by a collection of these features [154] .

Some of the most prominent image representations in computer vision and robotics before deep learning representations largely replaced them [57] , were based on Bag of Visual Words (BoVW) [161] . In brief, these methods were based on the sparse image partitioning of the image, where each region is labeled as a particular visual word, with visual words being learned from a large collection of image regions. The image is then represented by a histogram of the visual words. For example in the place recognition domain (again, prior to the use of deep learning representation approaches), Williams et al. [162] presented a comparison of different image representations, concluding that methods based on BoVW were generally superior. In the following subsections, we discuss two distinct ways of representing a scene: Scene Graphs and Generalized Semantic Representations, both of which are critical to semantic scene understanding for robotics.

1) Scene Graphs: Understanding a visual scene fully requires knowledge beyond just what is present in the scene. Much of the meaningful information can be extracted by examining the relationships between objects and other components of the scene, which is one of the main motivations of the scene graph research area.

Lin et al. [163] developed a CRF-based method to address indoor scene understanding from RGB-D data. The method integrates information from 2D segmentation, 3D geometry, and contextual relations between scenes and objects to classify the volume cuboids extracted from the scene. With this formulation, scene classification and 3D object recognition tasks are coupled and can be jointly solved through probabilistic inference. Johnson et al. [164] proposed scene graphs that represent objects, attributes of objects, and the relationships between them in order to retrieve semantically related images. Their method used CRFs to ground the scene graphs to local regions in the images. As an extension of [164] , Schuster et al. [165] explored the scope of scene graphs created automatically from a natural language description of a scene, using rule-based and classifier-based parsing to improve image retrieval. Following the work by [164] , Xu et al. [166] developed a method to generate scene graphs by end-to-end learning, using an RNN and iterative message passing. The main focus on this work was on improving reasoning about spatial relationships based on surrounding contextual cues within an image.

Li et al. [167] presented a Multi-level Scene Description Network (MSDN) that jointly learns to leverage mutual connections across three semantic levels of scene understanding: object detection, scene graph generation, and region captioning. Zellers et al. [168] explored a motifs-based scene parsing -looking for repeated patterns or regularly appearing substructures, in this case within graph representations of the scene. In particular, they proposed Stacked Motif Networks to capture higher order motifs in scene graphs, focusing on global contextual information to inform local predictors of objects and relations. Herzig et al. [169] developed a method that constrains the neural network architecture to be invariant to structurally identical inputs, based on the conditions of the Graph-Permutation Invariance (GPI). Johnson et al. [170] presented a method to generate images from complex sentences using scene graphs, that enables explicit reasoning about objects and their relationships. Their method uses graph convolutions to compute scene layout, that is then converted into an image using a cascaded refinement network trained adversarially against a pair of discriminators.

More recently, Ashual and Wolf [171] introduced a method that generated multiple diverse output images per scene graph using scene layout and appearance embedding information. Chen et al. [172] explored a semisupervised method that automatically estimated probabilistic relationship labels for unlabeled images using a small number of labeled examples. The goal of this work was to address a recurring issue that affects much of the research in this field; that the models have been trained with small visual relationships sets, with each relationship having thousands of samples. Addressing the issue of scene graph generation dataset biases, Gu et al. [173] introduced a scene graph generation method that uses external knowledge and image reconstruction loss to reduce this bias. To further address these biases, zero-shot learning techniques have been proposed that aim to achieve a generalized scene representation, which are discussed in the next subsection.

2) Semantic Generalized Representations via Zero-Shot Learning: A common practical issue in robotics is that the system can only be exposed to a subset of the classes it is expected to encounter during actual deployment. Zero-shot Learning (ZSL) addresses this problem: defined as a learning problem, it focuses on the scenario where the training process has access to the visual and semantic representations of only a subset of the classes. This observed subset is labeled as the seen classes; the challenge during deployment and testing involves the classification of unseen visual classes that do not belong to the set of seen classes [174] . Zero-shot learning generally depends on the learning of a mapping from the visual to the semantic space using the seen classes [174] [175] [176] [177] , where the hope is that this learnt mapping can be reliably used to classify the unseen classes. [178] proposed SP-AEN for zero shot learning to tackle the semantic loss problem in ZSL. Acknowledging that classification and reconstruction are contradictory objectives, an independent visual-to-semantic embedding was introduced for both the tasks. Both these semantic embeddings were used for adversarial learning, and to transfer information between the two tasks.

An alternative to mapping from the visual to the semantic space is to assume the availability of the semantic representations of the unseen classes during training. This assumption allows the implementation of a conditional generative model that is trained to generate visual representations from their semantic representations using the seen classes. Once the generator is stable, it is then possible to generate unseen visual representations which can be used to directly train a visual classifier with real seen visual representations and generated unseen visual representations [179] . This method can work better than others for the problem of Generalised Zero-Shot Learning (GZSL), where the testing process involves the classification of seen and unseen classes. GZSL is more challenging than vanilla ZSL because the classifier tends to be severely biased towards the classification of seen classes. Much of the current research in the field is therefore focused on achieving a balance between the classification of seen and unseen classes. The general theme of this research is to combine the semantic and visual representations, with modulation of the classification of seen and unseen classes based on the input test visual sample. For example, Felix et al. [180] proposed a cycle consistency loss to generate visual representations from semantic ones, and then in reverse generate the semantic representation. The modulation of seen and unseen classes has also been addressed by Atzmon and Chechik [181] . One can also learn a joint semantic and visual space that mitigates the need for learning a mapping between these two spaces [182] . The classifier is learnt in conjunction with a domain classifier to differentiate between seen and unseen domains. Niu et al. [183] proposed to reduce the projection domain shift, defined as the lack of generalization of visual-semantic mapping based on seen categories to unseen categories. Their approach involved learning an adaptive mapping for each unseen category, followed by progressive label refinement using unlabeled test instances. More recently, [184] proposed LsrGAN, which explicitly transfers the knowledge of semantic relationships between seen and unseen classes by generating corresponding mirrored visual features.

The overarching goal of object detection, semantic segmentation and scene representation is to enable effective scene understanding. It can be commonly observed in all these fundamental areas of research that the use of additional sources of information in the form of sensor data, knowledge repositories and textual descriptions can significantly enhance the task performance. This is highly relevant to robotics, as a robot typically carries multiple sensors, can actively explore its environment and has access to additional resources from the "cloud" online or on a local network. These tasks also have high functional overlap with those required of robots in many applications. In the following sections, we discuss how semantic understanding approaches like those discussed so far can be used in robotics, and the associated further research advances required to do so.

Mapping, localization, and navigation are among the longest running research areas in mobile robotics, as key capabilities for many autonomous systems operating in the sky, in water, or on land. In neuroscience and biology, there has been a long running debate about what animals explicitly construct maps for navigation and to what extent they do so, versus achieving navigation through behavioural and reactive techniques [185] . Likewise in robotics, there have been multiple research streams in the robot navigation space; some involving reactive type techniques like behaviour-based robotics [186] [187] [188] but many involving the explicit construction of a map and localization of a robot within that map, in order to plan and execute navigation tasks. Although the field has been an active one for many decades, much of the pivotal work in modern mapping systems occurred in the late 90s and early 2000s with the advent of Simultaneous Localization And Mapping (SLAM) -which is interchangeably referred to as both a problem field and desired capability.

Synonymous with the development of modern SLAM techniques was the widespread utilization of sonar and later laser range sensors [2] [3] [4] , which facilitated the production of highly accurate, geometric maps of the environment such as occupancy grid maps. SLAM and geometric maps enabled a significant range of navigational capabilities for robots actually deployed in the real world, especially in domains like mining and logistics [189] [190] [191] [192] [193] . But as demand grew for more sophisticated autonomous systems that could understand and interact in richer ways with their environments [194] [195] [196] , and the robots and people occupying those environments, researchers have been focusing on enriching mapping representations -which is where semantics has played a key role.

In recent years semantic SLAM [197] , [198] , semantic mapping [150] , [199] and semantically-informed localization [200] , [201] have all emerged as major new areas of focus in this research field. Initial forays into semantic approaches were largely based on adding semantic labeling or segmentation on top of existing traditional map representations [198] , [199] , [202] , but as the field has progressed semantics has become an increasingly integral component of techniques [197] , [203] . Before we review these systems, it is appropriate to first briefly revisit some of the key background in robotic mapping and navigation research.

A practical SLAM system typically employs a combination of interoceptive sensors (e.g., rotary encoders, accelerometers, inertial measurement devices) and exteroceptive sensors (e.g., LiDAR, sonar, radar, cameras). The mapping functionality in SLAM derives largely from the capability of the exteroceptive sensors to measure (directly or indirectly) structural or visual elements in the scene. While LiDAR and sonar were dominant sensing modalities in the early days of SLAM [2] [3] [4] , the use of cameras (e.g., standard RGB cameras, depth cameras, and event cameras) as the primary sensors for SLAM (thus leading to visual SLAM) is currently a major area of interest in robotics and computer vision. A key reason behind the popularity of visual SLAM is the flexibility and relatively low cost of optical sensing devices (e.g. consumer RGB and depth cameras). Moreover, compared to other exteroceptive sensors, cameras provide a richer source of information and thus the prospect of extracting higher-level understanding (e.g. semantics) of the scene.

1) Classical 3D Maps: Classical visual SLAM methods have largely focused on extracting the geometric 3D structure of the environment. Feature-based methods [204] [205] [206] [207] [208] [209] (both monocular and stereo) construct maps in the form of sparse 3D point clouds, where the 3D points are typically reconstructions of the salient local features detected in the images. To facilitate the generation of 2D-3D correspondences, descriptors of the local features associated with the reconstructed 3D points are also often embedded in the map. In contrast to featurebased methods, direct methods [210] [211] [212] [213] utilize a photometric error formulation to estimate structure and motion. Maps produced by direct SLAM methods are typically semi-dense or dense 3D point clouds, where each reconstructed 3D point is associated with corresponding pixels (that are not necessarily locally salient) observed across multiple frames in the input sequence. For visualisation, the point clouds are often texture-mapped with pixel RGB values. In contrast to conventional cameras, depth cameras [214] such as the Microsoft Kinect are able to directly acquire depth information at frame rate (such cameras are often called RGB-D cameras since they also record in the RGB channels, in addition to depth). Accordingly, RGB-D SLAM algorithms [215] [216] [217] [218] [219] are able to construct dense volumetric 3D maps of the environment. Since depth cameras can acquire metrically consistent depths, the 3D maps generated via RGB-D SLAM algorithms do not suffer from the global scale ambiguity problem which affects monocular visual SLAM systems.

Classical visual SLAM methods (as outlined above) have reached a level of maturity where accurate 3D maps (sparse or dense point clouds) and localisation of the observer within the map can be efficiently computed (though some require hardware accelerators such as GPUs). However, most practical robotics applications require more than just 3D maps; semantic SLAM is a cogent and timely extension to classical visual SLAM.

2) Topological Maps: Topological maps represent an environment as an abstract graph [220] , where nodes represent distinct places a robot has visited and edges between the nodes represent topological relations like proximity and order [221] . One of the limitations of classic metric maps had been the accumulation of error (drift) in global coordinates, despite the use of multiple sensors, beacons and elaborate error-tracking systems [222] . Topological maps do not suffer as directly from the accumulation of movement errors, as the robot has only to navigate locally between adjacent place nodes, although drift is still an issue for loop closure. Furthermore, topological maps are highly scalable when compared to classical metric maps that use detailed a priori models of the world [223] , [224] . Topological SLAM systems have been widely explored in the past where the use of multisensory approaches [225] has gradually been replaced with visual similarity-based metric error correction via loop closures [226] [227] [228] . One of the key enablers for relatively modern topological mapping systems has been the development of Bag-of-Visual-Words (BoVW) [161] like methods, leading to robust large scale appearance-based topological SLAM systems like FAB-MAP [229] . Existing surveys on topological maps [230] and visual place recognition [231] provide further details about developments in this field in recent years.

3) Hybrid Approaches: Hybrid map representations have been demonstrated to achieve an ideal balance between classic metric maps and topological maps [232] , where a map is represented as a hierarchy. In such representations, metric information from the local geometry of the scene is incrementally fused into a global "topometric" map, which is defined at a large scale using topological relations between distinct places (the hierarchy is almost always globally topological and locally metric rather than the reverse). Thrun [233] proposed an integration of grid-based and topological maps, where the latter partitions the former into coherent regions. Simhon and Dudek [234] explored a hybrid map representation where local metrically accurate maps, dubbed "islands of reliability", form nodes of a topological model of the world, thus avoiding the need to perform large-scale error integration. The Atlas framework developed by [235] comprised an interconnected set of local coordinate "frames". Each frame is a local metric map of the environment, connected to other frames via transformations represented as edges in the global graph of coordinate frames. [236] proposed a hybrid map representation, based on extracting "corners" and "openings" that represented topology, and "lines" that represented local geometric structure using a 360 • laser scanner. [237] explored a landmark-based co-visibility graph representation of the environment, where co-visibility corresponds to the connectivity of a topological map and inter-frame motion is used to encode metric transforms between landmarks.

More recently, hybrid maps have been explored beyond the bounds of two-layer hierarchies, and have started to introduce semantic concepts. [202] developed a probabilistic framework based on chain graphs to create a hierarchical hybrid map comprising four layers: sensory (accurate metric map), place (places and paths information), categorical (geometry and appearance of objects and landmarks), and conceptual (instances of spatial concepts relating a cereal box with kitchen). The proposed system used laser and camera sensors in indoor environments and attempted to relate conceptual knowledge with object and place semantics. [238] presented a cloud service-based semantic mapping system comprising an ontology to code concepts/relations in maps and objects (CAD models), built on top of an RGB-D metric map using both keyframe-based and 3D occupancy gridbased map representations. The proposed system enabled semantic mapping of novel environments and searching for novel objects within a semantic map. In [239] , authors used stereo frames to obtain depth, scene flow, visual odometry and semantic segmentation, all of which formed the input for semantic mapping based on 3D occupancy. This enabled reasoning on objects, and led to object instance discovery based on temporally consistent shape, appearance, motion, and semantic cues in the map, while also being able to handle dynamic objects. More recently, [240] presented a hierarchical framework for probabilistic semantic mapping using multiple cooperative robots in a distributed setting. Although hybrid maps, particularly those involving a semantic layer, are more amenable to human-robot interaction, they are still in a phase where certain implementation choices might limit their universal applicability. It's still unknown whether, given the complexity of robotic tasks and applications, a generalized solution is viable, or whether specialized solutions will be required for each class of applications.

Semantic mapping has been explored both in the context of place-level and object-level representation of the environment, where a more detailed semantic representation typically combines place and object labels in a hierarchical manner to create a hybrid map, as also discussed in the previous subsection. In this subsection, we first cover the use of scene classification and place categorization, where researchers have focused on abstracting the changing appearance of the scene into meaningful place labels as the robot explores an environment. We then discuss one of the clearest opportunities to enrich robot mapping systems with the use of objects. Since objects can be categorized and are meaningful by themselves, all this information can be used in a variety of ways by semantic mapping and localization approaches. In our discussion below, we further split the objectsbased approaches into two categories: those that use prior knowledge of the expected objects, often including 3D models, and those that have limited or no knowledge available beforehand, instead learning how to use objects at deployment time.

1) Scene Classification and Place Categorization: Semantic maps of an environment can be constructed by categorizing places with semantic labels which are typically pre-defined. Such semantic categories for different places can be defined by only considering the functionality of that place, for example, "kitchen", "printer area", and "seminar room" [241] . Alternatively, a more general hierarchical framework approach can also consider the structural properties, for example, having a broad classification based on "room" and "corridor" be followed by specialization labels of rooms like office and classroom [242] . [241] combined vision and range information to extract objects (e.g. monitor and coffee machine) and geometric features, respectively, which were fed to AdaBoost [243] to classify places and perform efficient global localization. [244] used semantics as "background knowledge" to explicitly represent environments with corridors and other indoor structures. Such a semantic distinction led to efficient multi-robot exploration of an environment, through learning of a behaviour which rewarded robots to preference exploring corridors, which led to unexplored branches of connecting rooms.

[245] developed a CNN-based place classification system trained using black and white images of occupancy grid maps (black uncarved space vs white carved free space) obtained through 2D laser scan data. The semantic classes used for the task included corridor, doorway, and room. [246] presented a CNN-based semantic mapping system that overcame the closed-set limitations of supervised classification by complementing the system with one-vs-all classifiers, in order to recognize new semantic classes online. The proposed system used Bayesian filtering to incorporate prior knowledge and ensure temporal coherence. Furthermore, they also demonstrated the effective use of semantics for improving object recognition and modulating a robot's behavior during navigation tasks. [247] presented a scene classification CNN that incorporates object-level information by regularization of semantic segmentation, demonstrated on indoor RGB-D data. [248] proposed a Dynamic Bayesian Mixture Model (DBMM) -a mixture of heterogeneous base classifiers -that incorporates time-based inferences from previous class-conditional probabilities and priors. Their system used 2D laser scans and indoor data for experiments, and extended their prior work [249] . [250] explored the use of fully-convolutional CNN for learning better feature representations for the task of semantic place categorization. This system used a Naive Bayes Nearest Neighbor (NBNN) method within the learning framework for end-to-end training. [251] developed a novel method for domain generalization for place classification to deal with the unknown/unseen deployment scenarios, when the test data might not be similar to the data used for training. In this case, domain generalization was achieved by automatically computing a model for the unknown domain through combining models of the known domains. [252] presented TopoNets for semantic mapping based on the topological structure of the environment, using a Sum-Product Network (SPN) as the backbone for learning and inference. The proposed system was demonstrated on various tasks: place classification, inferring semantics of unexplored space, and novelty detection using the COLD dataset, which includes semantic classes like doorway, kitchen, office, bathroom and laboratory.

Most of the aforementioned place categorization systems are based on supervised learning, consider only a finite set of pre-defined place labels, and therefore are not suitable for recognizing newly encountered places. [253] developed an online Bayesian change-point detection framework: PLISS (Place Labeling through Image Sequence Segmentation), enabling discovery of novel place categories, along with uncertainty estimation through consideration of the spatio-temporal characteristics. The semantic place labels obtained in this manner were further combined with high-level information like adjacency and place boundaries using Conditional Random Field (CRF), in order to obtain a semantic map of the environment [254] . [255] proposed an alternative approach to addressing open-set place categorization, HOUP (Histogram of Oriented Uniform Patterns), as an image descriptor. Although the framework does not create new semantic classes, the proposed representation is demonstrated to exhibit a balance between strong discriminative power for (specific) place recognition, and generalization capability for place categorization.

2) Using Pre-defined 3D Object Models: The ability of a robot to track objects in its working environment is critical for performing many tasks. Approaches based on scan registration were among the earliest to emerge in this domain, such as the work presented in [256] . They proposed to learn and track 3D object models from RGB-D indoor data by aligning multiple views of these objects within multi-resolution "surfel" maps. Real-time alignment on a CPU was achieved using a probabilistic optimization framework and an efficient variant of ICP. Other approaches have built a point-cloud based 3D map, and then identified objects within it as in [257] . The authors developed a shape-based method to build a 3D semantic map using point cloud data from an RGB-D camera. Data was segmented based on planes (horizontal/vertical), and an a-priori model library was used to identify objects and extract object features; this information was then used for wheelchair navigation. Another example can be found in [258] , where the authors developed a mapping-only framework based on indoor RGB-D data that created a triangle mesh for extracting and classifying planar regions as different furniture objects. The recognized objects were then replaced by their corresponding 3D CAD models following ICP alignment. Their system used OWL-DL (Web Ontology Language -Description Logic) and SWRL (Semantic Web Rule Language) as the ontology for defining object-property relations. [259] presented a hierarchical mapping approach based on detecting and clustering different objects while also considering their spatial relationships using a Bayesian classifier. For example, an office is considered to be constituted by "work-space", "meeting-space", and "storagespace", each of which are further composed of several objects.

Many researchers have focused on the ability to build a semantic map while also being able to track objects in the map. In [260] , the authors demonstrated the use of object-specific knowledge to obtain accurate maps within a dense SLAM system. They also highlighted how 3D object tracking and 3D reconstruction could benefit each other, thus improving reconstruction of unseen parts and enabling accurate estimation of the scale of the map. [198] presented SLAM++, performing 3D object recognition and tracking to produce an explicit graph of objects (with 6-DoF poses), which is then used in a pose-graph optimization framework for instance-level object-oriented 3D SLAM. The authors used a database of 3D object models and performed relocalization and loop closures in large cluttered environments, while also enabling interaction with objects.

Researchers have also explored the possibility of performing 3D object detection and recognition using a prebuilt 3D map. [261] proposed incremental real-time segmentation of a 3D scene, reconstructed by SLAM, in order to perform 3D object recognition and pose estimation. They also highlighted the advantages of using multiple views of the objects, as opposed to single-view based (2.5D) object recognition. Their system used 3D models and was demonstrated through an AR application. [262] developed semantic representation of an environment based on sparse point clouds, provided by a SLAM process, and semantic object detections, for example, cars detected through YOLO [263] . They used vision and inertial sensors (accelerometer and gyrometer -now ubiquitous in phones and drones) to obtain semantic and syntactic attributes respectively. These representations were fed to a localization-and-mapping Bayesian filter to enable persistent object representation and re-detection of temporarily occluded objects.

3) Without Pre-defined Object Models: Without explicit knowledge about some or all of the objects that are encountered in an environment, robots must be equipped with a means by which to deal with novel objects.

[264] presented a CRF model to jointly classify objects and room labels on a mobile robot, by incorporating information from both recognition of trained objects and classification of novel objects. They created a map with topologically connected rooms and metrically connected object poses using SURF features for object recognition. [265] proposed an unsupervised geometry-based approach for segmentation of 3D point clouds into objects and meaningful scene structures, which form a high-level representation of a 3D geometric map. They also developed a novel global plane extraction algorithm that enforced planes to be mutually-orthogonal or parallel, conforming with man-made indoor environments. [266] presented Fusion++, an online object-level SLAM system with a 3D graph map of arbitrary reconstructed objects, where objects are incrementally refined via depth fusion, and are used for tracking, relocalization and loop closure detection, without intra-object warping. The proposed pipeline uses Mask-RCNN for instance segmentation that initializes per-object TSDF reconstruction, and was demonstrated on RGB-D indoor sequences.

The open set recognition research field is also relevant here [267] [268] [269] [270] [271] [272] , and some existing mapping work has focused on dealing with open set conditions. [199] built a semantic map with both object-level and low-level (point and mesh based) geometric representations that functions under open-set conditions and handles unseen instances. Their pipeline involves feature-based RGB-D SLAM, deep-learnt object detection, and 3D unsupervised segmentation [265] . [273] incrementally builds a database of object models from a traverse of a mobile agent requiring no prior knowledge of shapes or objects present in the scene. The presented pipeline includes: Global Segmentation Map (GSM) built from RGB-D images, object-like segment extraction, intra-segment matching and merging with previous instances in the database, and reconstruction of unobserved parts of the scene from merged models. [274] incrementally builds volumetric object-centric maps using a RGB-D camera, while also reasoning jointly over geometric and semantic cues using a frame-wise segmentation approach. Their system infers high-level category information about detected and recognized elements, and discovers novel objects in the scene without requiring prior information about the objects. The proposed method also enables a distinction between unobserved and free space for enhanced human-robot interaction.

Semantic mapping pipelines are not typically computationally cheap. While continual improvements in compute hardware help with this issue, as covered in Section VI, efficiency is always beneficial with respect to cost, power consumption and deployment versatility. Research has therefore focused on improving absolute efficiency and scalability to larger environments. [275] developed a real-time incremental segmentation method for 3D point clouds obtained through SLAM, yielding segmentation in real-time, with complexity independent of the size of the global model. The proposed method is generally applicable to any frame-wise segmentation and any SLAM algorithm and was demonstrated in indoor environments. [276] presented dense, large-scale, outdoor semantic reconstruction of a scene in (near) real time that was also capable of handling dynamic objects through semantic fusion. They used hash-based techniques for large-scale fusion and efficient mean-field inference with dense CRFs, claiming it to be the first of its kind. [277] presented highly accurate object-oriented scene reconstruction in real-time by using fast and scalable object detection for semantics and geometric incremental segmentation. They reduced computational cost and memory footprint by only labeling segmented regions and not individual elements in the 3D map. [278] performed on-the-fly dense reconstruction and semantic segmentation of 3D indoor scenes using an efficient super-voxel clustering, and a CRF based on higher-order constraints derived from structural and object cues.

A reliance on objects brings with it new challenges, with one of the largest being the inconvenient property of them being movable. [279] presented a dense RGB-D SLAM system that segments the scene into different objects using either motion or semantic cues while tracking and reconstructing their 3D shape in real time. It allows objects to move freely by fusing its shape over time using only the pixels associated to that object label. Consequently it is able to deal with dynamic scenes without treating moving objects as outliers, as was the approach in much prior research. [280] presented MaskFusion, a real-time, object-aware RGB-D SLAM system that recognizes, segments, and assigns semantic labels to different objects, even if multiple objects are moving. The proposed system uses image-based, instance-level semantic segmentation to create an object-level semantic map, unlike the voxel-level representations used in prior work. More recently, [281] developed a novel framework for dense piece-wise semantic reconstruction of dynamic scenes using motion and spatial relations, where moving objects are handled by imposing constraints based on the spatial locations of neighboring superpixels.

Shape is an important property when dealing with objects, resulting in research focusing on using parameterized geometric primitives. [282] represented indoor environment objects as cuboids for semantic mapping; the cuboid detection method is based on image segmentation and plane fitting, and the cuboid matching is based on features like emptiness, orientation, surface coverage, and distance from edges. [283] presented QuadricSLAM, a factor graph based SLAM system that uses dual quadrics to represent 3D landmarks, derived from 2D object detections obtained over multiple views. They proposed a new geometric error formulation while also addressing the challenges of object occlusions. Building on this work, [284] integrated additional planar and point constraints that help stabilise the SLAM estimate. Later, [285] explored how single view point cloud reconstructions of objects (via a CNN) can effectively constrain the shape of dual quadric landmarks. [286] developed a framework to detect and compactly represent changes in the environment. This is achieved though multi-scale sampling of point cloud data, change detection using Gaussian Mixture Models, a superquadrics-based representation of objects that caused the change, and final refinement and optimization. Other related research includes research on cuboids from [163] as well as basic primitives from [287] . 4) Scene Graphs at Environment Level: A dense grid representation of an image or an environment can be thought of as a specific case of a graph structure. We discussed the use of scene graphs previously at image level based on spatial relationships between various objects or regions observed in a single image. In a robotics context, the concept of a scene graph typically comprises spatial elements at an environment level which the robot explores over time. In this vein, a number of researchers have explored the use of scene graphs for better representation of the environment [200] , [203] , [288] [289] [290] [291] [292] , leading to improved spatial reasoning.

Using the concept of a Directed Acyclic Graph (DAG) [293] , [288] proposed a 3D scene representation, Robot Scene Graph (RSG), which defines the organization of topological and spatial relations between objects, semantics of such relations, time-based handling, computational assets, and resource sharing. [294] extended this work with a Domain Specific Language (DSL) with four levels of abstraction for RSG [288] , used for modeldriven engineering tool chains in robotics. RSG-DSL is capable of expressing (a) application specific scene configurations, (b) semantic scene structures and (c) inputs and outputs for the computational entities that are loaded into an instance of a world model. [291] presented a 3D scene graph constructed using a RGB-D data processing pipeline: keyframe extraction, spurious detection rejection (object detection based), local 3D scene graph construction, and finally, graph merging and updating for a global 3D scene graph. The proposed method was demonstrated on two tasks: Task Planning and Visual Question and Answering. [295] developed a Boltzmann Machines-based generative scene model for representing objects and their spatial relations and affordances while also considering co-occurrences. In order to solve the cross-view localization problem, [200] used a graph of semantic blobs defined over a sequence of images, with a random walks strategy used to match a query sequence.

[296] demonstrated a video captioning system based on a spatio-temporal scene graph that explicitly captures object interactions using directed temporal edges and undirected spatial edges.

Recently, [203] proposed 3D Dynamic Scene Graphs (DSGs) which define an environment with multiple layers of abstraction, starting from a metric-semantic mesh, to objects, places and rooms and eventually to the whole building. While the aforementioned entities form the nodes in the directed graph, the edges encode pairwise spatio-temporal relations and explicitly model dynamic entities in the scene like humans and robots. DSGs represent the current state-of-the-art in terms of a high-level actionable representation of the environment, where robots can semantically reason about the space they are operating in and interact with humans.

C. Semantic Representations for SLAM and 3D Scene Understanding 1) SLAM with Semantic Segmentation: [297] proposed a semantically meaningful indoor mapping system based on the Manhattan World assumption, where photometric cues were combined with pose information and sparse point cloud data obtained from an underlying metric SLAM system. In a subsequent work [298] , the authors developed a more efficient approach based on dynamic programming to label the indoor environment as floor, wall or ceiling. Further extending the use of Manhattan world, [299] presented a joint inference procedure based on a Bayesian framework that combines photometric, stereo and 3D data to reason about floor and ceiling planes, and thus enable effective semantic scene understanding. Beyond reasoning at the surface level, researchers have also explored 3D structure-based semantic representations. [300] proposed a 3D occupancy map based on joint inference of 3D scene structure and semantic labels for outdoor monocular video data. They used a CRF model defined in 3D space and class-specific semantic cues to constrain the 3D structure in areas where multi-view constraints are weak. [301] described a process for transferring labels from 2D to 3D based on Bayesian updates and dense pairwise 3D CRFs for indoor RGB-D data, combined with a fast 2D semantic segmentation approach based on Randomized Decision Forests. [302] integrated multi-view image segmentations within an octree-based 3D map by modelling geometry, appearance, and semantic labeling of surfaces for indoor RGB-D video.

For robotics applications, real-time operations are typically necessary. Hence, some of the research work has particularly been focused on efficient processing. The pipeline proposed in [302] is based on random decision forests and probabilistic labeling using a Bayesian framework, and performs in real time or better on CPU and GPUs. [303] developed a scale-drift-aware, monocular semi-dense mapping system that can seamlessly switch between indoor and outdoor scenes where previous methods struggled. The system also saves computation time by enabling frame skipping for 2D segmentation, and through only considering keyframe connectivity and spatial consistency. [150] presented SemanticFusion, a dense semantic 3D mapping system for indoor RGB-D data where semantic predictions from a CNN are probabilistically fused from multiple viewpoints using ElasticFusion (Dense SLAM) [304] . The fusion technique also improves 2D semantic labeling performance over single frame predictions. The proposed system works in real time at 25 Hz. [305] combined CNN-based 2D semantic labeling with RNN-based data association (found using ElasticFusion [304] ) for dense semantic mapping on RGB-D indoor data. [306] presented a 3D scrolling occupancy grid map, achieving near real time performance, with relatively low memory and computational requirements and an upper bound as environment size scales. To achieve this, they used a novel hierarchical CRF model with CNN-based 2D segmentation to optimize 3D grid labels on top of a stereo ORB-SLAM based grid map, and used superpixels to enforce smoothness. [307] proposed semantic labeling of indoor RGB-D data based on an encoder-decoder type network with two branches, one each for RGB and Depth, which improved performance when combined together. They found that the "HHA" representation [308] of depth images -encoding Horizontal disparity, Height above ground, and the Angle the pixel's local surface normal makes with the inferred gravity direction -improved performance. Both sparse and dense fusion also improved performance. [309] used a CNN to predict semantic segmentation for RGB-D sequences. They trained the CNN to predict multi-view consistent semantics in a self-supervised way, enabling improved fusion at test time -enhancing [307] with multi-scale loss minimization. Their system performed better than single-view baselines. Kimera [197] is a real-time metric-semantic SLAM system that works in real-time and integrates 2D semantic segmentation, IMU measurements, and optional depth measurements into a dense semantically annotated mesh of the environment.

2) Semantic Scene Understanding using 3D Point Clouds: While 3D scene information can (somewhat laboriously) be derived from scale-unaware monocular visual SLAM algorithms for online mapping [209] , [210] , [310] , [311] , a common avenue for estimating pixel depth in robotics research is to use range sensors, for example, Laser Scanners [312] , [313] , stereo camera pairs [314] , LiDAR, radar, sonar or RGB-D sensors (like Microsoft Kinect [315] , Intel RealSense, Apple PrimeSense and Google Tango) [49] , [316] . The point cloud data from these sensors can be used to obtain a semantic map of the environment either directly [312] , [313] , [317] or in conjunction with color cameras [316] , [318] . In this section, we discuss various semantic mapping techniques based on 3D information obtained either directly from range sensors [313] , [316] or indirectly inferred using SfM (Structure from Motion) offline [318] or other commercial solutions [319] , like Matterport [320] .

[313] proposed a method for creating a 3D semantic model from cluttered point cloud data using a CRF model to discover and exploit contextual information for classifying planar patches, without relying on predefined rules [312] . The results suggested that using co-planar context improved semantic classification results, while other tested context types didn't help.

Researchers have also focused on efficiently processing point cloud data by either avoiding redundant processing [318] or using approximation techniques [316] . [318] proposed exploiting the geometry of a 3D mesh model obtained from multi-view SfM reconstruction to avoid the redundant labeling of visually-overlapping individual 2D images. Instead of clustering similar views, their method searched for the ideal view that best supported the correct semantic labeling of each face of the underlying 3D mesh. Their proposed single-image approach performed better than label fusing of multiple images, while also being more efficient. [316] presented an efficient semantic segmentation framework for indoor RGB-D point clouds combining a Random Forest classifier and dense CRF to learn common spatial relations via pairwise potentials. The use of parallelization and mean-field approximation for CRF inference enabled a halving of computation time.

Many conventional semantic segmentation based techniques consider small-scale point cloud data. Focusing on larger scalability, [319] presented a detection-based semantic parsing method for large-scale indoor point clouds. The proposed pipeline was based on a hierarchical approach that first created semantically meaningful spaces (e.g. rooms) and then parsed them into their structural and building elements (e.g. walls and columns) -all of it in a global 3D space, where the first step injected strong 3D priors into the second. Demonstrated at scale in an area covering over 6000 square meters and 215 million points, the study highlighted a unique set of challenges and opportunities associated with parsing such large point clouds, including the richness of recurrent geometric information and the introduction of additional semantic classes.

Recent trends include the advent of modern deep learning-based approaches, the increased availability of 3D point clouds [135] , [319] for robots, and a growing number of shape datasets [321] , [322] . Connected with these trends has been significant growth in research focusing on 3D object detection and semantic segmentation along with shape and scene completion, areas highly relevant to dense semantic mapping. 3D ShapeNets [322] and VoxNet [317] pioneered the use of 3D CNNs for object recognition using 3D point clouds. Previously existing "2.5D" approaches based on RGB-D data mainly considered depth as an additional 2D channel to the RGB input and have been extensively leveraged for tasks like object recognition [308] , [321] , [323] [324] [325] [326] , grasp detection [327] , vehicle detection [328] and semantic segmentation [308] , [329] . The use of voxel grids [317] , [322] to learn 3D representations for such tasks is both conceptually different and significantly robust, as discussed in subsequent sections.

3) Dense Volumetric Representation and 3D CNNs: Wu et al. [322] proposed representing a 3D geometric shape on a 3D voxel grid and learnt a joint probabilistic distribution of binary variables on the grid using a Deep Belief Network (DBN) [330] . Distinct from their generative model aimed at shape reconstruction, Maturana and Scherer [317] presented VoxNet: a basic 3D CNN architecture designed for voxel grid representation of point clouds aimed at real-time 3D object detection, with an order of magnitude fewer trainable parameters than 3D ShapeNets [322] . [331] extended U-Net [105] to 3D U-Net to learn from sparsely annotated volumetric data, applied to biomedical data segmentation. Unlike the use of pre-segmented objects in [317] , [322] , [331] , Huang and You [48] extended the use of 3D CNNs using a LeNet [332] based architecture for semantic segmentation of raw point cloud data in outdoor scenes comprising a variety of semantic classes. 3D voxel representations used with 3D CNNs often lead to coarse semantic segmentation [48] , [333] , by assuming all the points within a voxel belong to the same object class. Addressing this problem, [334] proposed SEGCloud as an end-to-end framework, where coarse voxel predictions from a 3D Fully Convolutional CNN are transferred back to the raw 3D points via trilinear interpolation. Fine-grained semantics are then obtained from a global consistency constraint enforced by the Fully Connected Conditional Random Field. 4) Scene Completion and Semantic Segmentation: "Semantic Scene Completion" [335] refers to the joint learning task of scene completion and semantic segmentation, leading to enhanced understanding of the environment, particularly relevant to robotic exploration. Song et al. [335] used single-view depth map observations to produce a complete 3D voxel representation with semantic labels. They proposed SSCNet: an end-to-end 3D CNN that uses a dilation-based 3D context module to efficiently expand the receptive field and enable 3D context learning. Dai et al. [336] presented a 3D CNN for shape completion that encodes the global context using semantic class predictions from a 3D shape classifier. They used a patch-based 3D shape synthesis method to refine the predictions, by imposing 3D geometry from shapes retrieved from a prior database. In follow-up work, Dai et al. [337] designed a 3D CNN to process an incomplete 3D scan and predict a complete 3D model with per-voxel semantic labels. The proposed CNN is a fully-convolutional generative network, with filter kernels that are invariant to the overall scene size, enabling processing of large scenes with varying spatial extent. They also developed a coarse-to-fine inference strategy to produce high resolution output. 5) Sparse Volumetric Representation and 3D CNNs: One of the common challenges in dealing with 3D point cloud data is its non-uniform density, where certain regions of the scene may not have any information at all, or where existing information may not always be semantically informative. A uniform voxel grid therefore may not be the best way to represent the data. In this vein, [338] proposed OctNet, a 3D CNN suited to sparse 3D data, enabling efficient representation of high resolution input. Their method hierarchically partitioned the 3D voxel grid using a set of unbalanced octrees [339] where each leaf node stored a pooled feature representation, thus enabling effective allocation of memory and computation to the relevant dense regions. Sparse 3D convolutions have also been explored to avoid redundant computations [340] [341] [342] . [340] extended sparse 2D CNNs used for hand-writing recognition [343] to sparse 3D CNNs, based on the concept of performing convolutions only on the "active" spatial locations that do not differ from their "ground state". [341] presented Vote3Deep: a feature-centric voting mechanism [344] to perform sparse 3D convolutions by only considering non-zero feature locations. The use of ReLU [345] after a sparse convolutional layer also prevented the dilation of non-empty cells, thus achieving greater sparsity than [340] . [342] proposed Submanifold Sparse Convolutional Networks (SSCN) that fix the spatial locations of active sites to prevent the "submanifold dilation problem" [340] , keeping the sparsity unchanged for subsequent layers of the network. [346] proposed learning high dimensional linear filters for sparse feature spaces using permutohedral lattices, and demonstrated learning the kernel parameters for a general bilateral convolution in semantic segmentation. [347] presented the Field Probing Neural Network (FPNN) that employs field probing filters to efficiently extract features from volumetric distance fields, learning the weights and locations of the probing points that compose the filter. In order to obtain a semantically informative map, [314] proposed using semantic segmentation to remove redundancy and noise from dense 3D point clouds attributed to planar surfaces like ground and walls. They developed a variety of point cloud simplification techniques that employed global and local region statistics to selectively decimate 3D points, while mostly preserving those near intra-class edges and discontinuities.

6) 2D Multi-View Representation of 3D: The spatial sparsity within 3D point clouds can lead to decreased spatial resolution and increased memory consumption when using regular voxel grids [333] . To mitigate these issues, an alternative approach processes multiple 2D views generated from the 3D point cloud [348] and then projects them from 2D back to 3D [47] , [49] . [348] presented a Multi-View CNN (MVCNN) to learn shape recognition from 2D renderings of a 3D point cloud, demonstrating that even a single 2D view based prediction could outperform 3D CNNs based on volumetric representation [322] . [47] projected the point cloud onto a set of synthetic 2D images which are fed to a 2D CNN for semantic segmentation, and then re-projected this to the point cloud. Benefiting from the abundance of 2D labeled data and a multi-stream fusion of color, depth and surface normals, their method achieved superior performance to comparable baselines. [49] proposed to sample multiple 2D image views of a point cloud using random and multi-scale sampling strategies. They employed both RGB and depth-based views for 2D pixel-wise semantic labeling, with labels efficiently back-projected via buffering. [349] explored "tangent convolutions" with an emphasis on leveraging 2D surface information from 3D point clouds, by projecting local surface geometry on a tangent plane around every point to obtain a set of tangent images.

Although the methods based on multiple 2D views of 3D data are in general more efficient and scalable than voxel grid-based methods, they may not always be applicable for challenging 3D shape recognition tasks due to loss of information during projection operations [350] . Alternatives in the form of hybrid representations [16] of 3D point cloud have also been considered. Dai et al. [50] proposed an end-to-end trainable 3D CNN to predict per-voxel semantic labels. Their method combines two streams of features: one extracted from per-voxel max pooling of multiple 2D RGB views and the other from 3D geometry, demonstrating a significant performance improvement attributed to this joint 2D-3D learning.

7) Unstructured Points Based Representations: While volumetric representations present the environment in a structured manner, handling point cloud data as a set of unordered and unstructured points has proven to also be a viable approach. [46] presented PointNet, a novel deep network that processes points sets without voxelization or rendering and learns both global and local features. Demonstrated to be efficient on tasks like object classification, part segmentation, and scene semantic parsing (dense semantic mapping), PointNet works on an unordered set of points exhibiting permutation-invariance, unlike ordered 2D image data or a 3D volumetric grid, while also being robust to input perturbation and corruption. [351] proposed PointNet++, a hierarchical NN that applies PointNet recursively on a nested partitioning of the input point set, enabling learning of local features with increasing contextual scales. With the ability to capture local structures induced by the metric space of points, PointNet++ demonstrated improvement on semantic segmentation of single-view 3D point clouds. [352] explored direct semantic segmentation of unordered point clouds by extending PointNet [46] , particularly enlarging its receptive field over a large 3D scene to incorporate larger-scale spatial context at both input and output level. Building on point-level representations, significant research has focused on developing learning algorithms based on characteristics like unordered structure, intra-point interaction and transformation-invariance [15] , for example, by using Point-wise convolutions [353] , [354] , Recurrent Neural Networks [355] , [356] , Graph Neural Networks [357] , [358] and Autoencoders [359] , [360] .

8) Graph-and Tree-Based Representations at Scene Level: In order to better capture the structure of 3D point clouds without requiring voxel grids, researchers have explored the use of graph- [361] , [362] and tree- [350] , [363] based networks. [361] proposed an Edge-Conditioned Convolution (ECC) to learn from the local neighborhood of the point cloud represented as a graph, where filter weights were dynamically generated for each specific input sample. [362] designed a 3D Graph neural network that builds a k-nearest neighbor graph on top of 3D point clouds, where each node in the graph corresponds to a set of points represented by appearance features extracted by a unary CNN from 2D images. This approach consequently was able to leverage both 2D appearance and 3D geometric relationships. The method is capable of extracting long range dependencies within imagestypically difficult to model in traditional techniques. Inspired from superpixels [158] , [364] introduced SuperPoint Graphs (SPG) where nodes represent simple shapes, and edges encode contextual relationships between object parts, enabling semantic segmentation of large-scale point clouds. [350] proposed Kd-Net for parsing 3D point clouds, using a kd-tree structure [365] to form the computational graph, and hierarchical representations where the root node is recursively computed from representations of its children. Addressing the lack of overlapping receptive fields in Kd-Net, [366] presented SO-Net based on the Self-Organizing Map (SOM) [367] , where the receptive field overlap is controlled by performing point-to-node knearest neighbor (kNN) search on the SOM, enabling a better representation of the underlying spatial distribution of point clouds.

The increasing ubiquity of 3D sensing [49] , [316] , [320] , maturity of visual SLAM [197] , [368] and SfM techniques [369] [370] [371] , and the evolution of deep learning have together facilitated extensive research on 3D point clouds. This has led to numerous novel approaches, for example, ShapeNet [322] , VoxNet [317] , OctNet [338] , PointNet [46] , tangent convolutions [349] , SSCN [342] and Kd-Net [350] , which have pushed the boundaries of 3D scene understanding. Although capable of processing information at an environment level -beyond what can be achieved through a single image, these methods do not typically consider incremental scene understanding. In the near future, bridging the advances in 3D point cloud processing and 3D reconstruction capabilities of modern SLAM techniques will significantly benefit robots in semantic scene understanding.

The absence of semantics in much of the pivotal early robotic mapping and navigation research was due in part to the fact that significant progress in mapping and navigation could be made without the robot needing a richer understanding of the world around it. Most of the mapping systems in current commercial deployments like mines and ports have little higher-level understanding of the environment they are operating in. This same story also played out to some extent in robotics research focused on enriching the interactions of robots with the world around them, and with the humans who occupy that world. Early work in areas like manipulation and human-robot interaction necessarily focused more on the technical mechanics of the interaction, not least in part due to the computational and hardware limitations of the time.

It is now becoming increasingly evident that further progress will rely on robots moving beyond a primitive understanding of the world around them. Robots will need to understand the "things" in the environment surrounding them and what the "affordances" of those things are, in order to understand what the robot can achieve with those things. In mixing freely with humans in the environment, robots will also need to understand humans; perhaps not initially at a deep cognitive level, but at least in terms of recognizing what humans are doing -their actions -in order to inform the robot's potential interactions with humans. The following sections survey the state of robotics research into semantics for enhancing the capability of robots to understand and interact with the world around them. We first discuss the literature in the context of "perception of interaction" which covers the works related to understanding of different actions and activities performed by humans, and the use of hands and arms to interact with objects. Then, we discuss "perception for interaction" covering the use of perception to interact with the objects or humans around a robot, including research related to interaction with objects (affordances, manipulation and grasping), performing higher-level tasks, interacting with humans and other robots and finally, navigation based on vision and language.

A. Perception of Interaction 1) Actions and Activities: When interacting with the world, especially with the humans that occupy it, an understanding of actions and activities is essential. In its most basic form, activity recognition can be simply treated as a supervised machine learning problem-present the machine with a short video clip and classify the entire clip into one (or more) of k different categories learned from a large corpus of labeled examples [372] [373] [374] [375] [376] . Moving beyond just classification, localization of specific activities within a long video sequence involves identifying both the temporal and spatial location of each activity, often in so-called "action tubes" [377] . This approach to activity recognition and detection is typified by the various tasks in the annual ActivityNet Challenge and its associated dataset of human activities [378] .

Many of the works in this area consider activity recognition from the perspective of a third-person observer (a notable exception being the EPIC-Kitchens egocentric activity classification dataset [379] ) and learn blackbox models that segment and label activities. In some cases the temporal aspect of an action is modeled through dense trajectory features [380] , which track the movement of keypoints through time. A sequence of atomic action units or "Actoms" [381] can also be used to represent a semantically-consistent action sequence as opposed to learning its discriminative parts. [382] show that a complex activity composed of several short actions is better represented and compared using a hierarchy of motion decomposition. The two-streams approach [372] separates appearance and motion (optical flow). A rank-pooling method developed by Fernando and colleagues [383] [384] [385] encodes frame order as a function of the frame's appearance. An efficient first-order approximation of rank-pooling, known as Dynamic Images [386] , has been applied successfully in many deep learning approaches for activity classification. However, even Carreira and Zisserman [376] , who ask the question "Quo Vadis, Action Recognition?" 1 , do not fully consider a more fine-grained understanding of human actions (e.g. different types 1 Where are you going action recognition? of swimming) likely necessary for robotic applications. Robots are likely to require not just classification of the activity being performed, but also localization of both the person performing the activity and the object being interacted with (if any).

One of the most important pieces of semantic information for understanding activities is the pose of the human performing the activity and how that pose changes over time. Ramanan and Forsyth [387] 's work is an early example of research proposing to use human pose and motion to classify activities. In their model body parts are tracked over time, with sequences annotated using a hidden Markov model. In more recent work, Wang et al. [388] improve the estimation of human pose by first building dictionaries of spatial and temporal part sets and then using a kernel SVM to classify actions. Luvizon et al. [389] solve human pose estimation and activity recognition jointly and train a single deep learning model to perform both tasks, with pose and appearance features combining to predict actions.

In the context of robotic vision, Ramirez-Amaro et al. [31] present a recent survey of the most representative approaches using semantic descriptions for recognition of human activities, with the intention of subsequent execution by robots. In such a context it is advantageous to characterize human movement through multiple levels of abstraction (in this particular case four) from high-level processes and tasks to low-level activities and primitive actions. Given this hierarchy and a "semantic" representation of a human demonstrated task, the ultimate goal is for a robot to replicate the task, or assist the human in achieving the intended outcome.

Similar to the work of Ramanan and Forsyth [387] , Park and Aggarwal [390] propose to recognize human actions and interactions in video by modeling the evolution of human pose over time. Their framework uses a three-level abstraction where the pose of individual body parts (e.g. head and torso) are identified and linked at the lowest level using a Bayesian network; actions of a single person are modeled at the mid-level using a dynamic Bayesian network; and alignment of multiple dynamic Bayesian networks in time allows inference of interactions at the highest level. The result is a representation that can be translated into a meaningful semantic description in terms of subject, verb and object.

Reducing the reliance on large quantities of annotated training data, Cheng et al. [391] developed a zero-shot learning framework that also considers a three-level hierarchy that takes into account low level features, midlevel semantic attributes and high-level activities mapped through a temporal sequence. The result is a model that can recognize previously unseen human activities. Cheng and his colleagues extended this work by proposing a two-layer active learning algorithm for activity recognition for unseen activities, using a semantic attributesbased representation. In this case an attribute is a human-readable term describing an inherent characteristic of an activity [392] .

2) Hands and Arms: Within the broader topic of general activity recognition, hands and arms play a key role in activities where humans are involved. [393] proposed forming a semantic representation of human activities by extracting semantic rules based on three hand motions: move, not move, and tool use, and two object properties: ObjectActedOn and ObjectInHand. Their system is adaptable to new activities which can be learned on-demand. [394] extended the work in [393] and explored the use of virtual reality-based additional 3D information to understand and execute the demonstrated activities. They integrated their method on an iCub humanoid robot. [395] also developed a method to infer human-coordinated activities (like use of two hands), that was invariant to observation from different execution styles of the same activity (e.g. left-handed vs right-handed). They used a three-level approach: extract relevant information from observations, infer the observed activity, and trigger the motion primitives to execute the task.

[396] explored an unsupervised learning method based on Independent Subspace Analysis (ISA) to extract invariant spatio-temporal features directly from unlabeled video data. A second stage automatically generated semantic rules, enabling high-level reasoning about human activities that resulted in improved performance. [397] proposed semantic representations of human activity, where semantic rules based on relationships between human motions and object properties are first learned for basic and complex activities, and then used to infer the activity. They used Web Ontology Language (OWL) and KnowRob for incorporating knowledge/ontology. [398] extended the semantic reasoning method from [393] by including gaze data (in addition to the third person view) to segment and infer human behaviors. They demonstrated the complementary nature of first-and thirdperson view (using egocentric and external cameras respectively), enabling the system to deal with occlusions. Finally, [399] presented a novel learning-by-demonstration method that enables non-expert operators to program new tasks on industrial robots. The proposed semantic representations were found to be invariant to different demonstration styles of the same activity.

[400] presented a new representation for 3D reconstructed trajectories of human interactions, referred to as "3D Semantic Map" -a probability distribution over semantic labels observed from multiple views. They use semantics of body parts, for example the head, torso, legs and arms, and other objects in the vicinity, such as a dog or a ball. Spatio-temporal semantic labels are inferred by graph-cut formulation based on multiple 2D labels. The authors claim "This paper takes the first bold step towards establishing a computational basis for understanding 3D semantics (of human interaction) at fine scale." Chang et al. [401] , [402] took a complementary approach by finding structural correspondences of objects that have similar topologies or motions. For example, their method is able to match corresponding body parts (including arms and hands) of a humanoid robot with that of a human from a video sequence without any prior knowledge of the body structures. [403] discussed the availability of large amounts of visual data from wearable devices and existing methods used to characterize everyday activities for visual life-logging, for example, semantic annotations of visual concepts and recognition and visualization of activities. Motivations in this area of research include its use for behavioral analysis and in assistive living scenarios.

B. Perception for Interaction 1) Object Affordances: The term "affordances" was coined by the psychologist Gibson [404] to explain how inherent values of things in the environment can be perceived and how this information can be linked to the potential actions of an agent. In one of the early applications of the concept in robotics, [405] presents a robot that learns the roll-ability of four different objects by "playing" with them and observing the changes in the environment; however the generalization of this affordance to novel objects and affordances is not studied. In order to formalize the concept of affordances in the context of autonomous robotics, [406] reviews the usage of affordances in other fields, and proposes a framework based on relation instances of the form (effect, (entity, behavior)) which are acquired through the interaction of the agent with its environment. [407] implements this framework for a mobile robot, enabling goal-oriented navigation by learning the affordances of its environment through the effects of primitive behaviours such as "traverse" or "approach".

Performing complex manipulation actions requires the encoding of temporal dependencies in addition to an understanding of affordances. Closely related to the formalization of affordances above, the Object-Action Complexes (OACs) concept [408] attaches the performed actions to the objects as attributes and provides grounded abstractions for sensory-motor processes. By defining affordances as state transition functions, OACs allow for forward prediction and planning. Several researchers have studied deriving such attributes from visual information [409] [410] [411] [412] . The Semantic Event Chain (SEC) framework developed by [409] analyzes the sequence of changes of the spatial relations between the objects that are being manipulated by a human or a robot. This approach generates a transition matrix which encodes topological variations in a sequence graph of image segments during the manipulation event, allowing transfer of manipulation actions from human demonstration to a robotic arm. [410] designed an architecture for generic definition of a robot's human-like manipulation actions. Manipulations are defined using 3 levels: a top level, which abstractly defines objects and their relations and actions; a mid level, which defines chaining of action primitives via SEC; and a bottom level that defines sensory data collection and communication with the robot control system.

Using these representations, a number of approaches have been presented for autonomously learning affordances in the context of manipulation. [411] proposes online incremental learning of an archetypal Semantic Event Chain model for each manipulation action by observation, without requiring any prior knowledge about actions or objects. Similarly, [412] leverages information from the cloud, by learning manipulation action plans from cooking, fetched from the web. Lower-level perception modules, grasping type recognition and object recognition is combined with a probabilistic manipulation action model at the higher level. [413] studies pre-grasp sliding actions to facilitate grasping, for instance sliding a book to the table edge so that it could be grasped. In this work, objects are divided into categories, such as box or cylinder, based on sub-symbolic object features including shape and weight. A sliding action is simulated before actual execution, conditioned on the object class. Afrob [414] provides a database for affordances, targeted at domestic robot scenarios, consisting of 200 object categories and structural, material and grasp affordances.

The following subsections discuss more concrete applications of affordance and semantics to robotic tasks, ranging from low-level, e.g. item "graspability", to high level concepts such as understanding language and social behaviour.

2) Grasping and Manipulation: Robotic grasping is concerned with the problem of planning stable grasps for an object using a robotic gripper. Early approaches to this problem were largely analytical, focusing on providing a theoretical framework based on mathematical and physical models of contact points between the manipulator and object, used to quantify the stability and robustness of grasps [415] , [416] . However, the reliance of such methods on precise physical and geometrical models makes them brittle in the presence of noise or uncertainty found in robotic systems. As such, many modern approaches to robotic grasping are data-driven, using generalizable models and experience-based approaches to improve calculation times, robustness to errors and enable generalisation to novel objects [417] .

A common approach in data-driven grasp synthesis is to use semantic information to transfer grasping knowledge to "known" or "familiar" objects. A number of works rely on precise object recognition and pose estimation to apply pre-computed grasps from a known object model to an instance of that model in reality [418] [419] [420] [421] [422] [423] [424] . More general methods work with familiar objects; those which are semantically or geometrically similar to objects previously seen, and are able to generalise grasps across instances of an object's class (for example mugs) [425] , [426] or object parts [427] , [428] (for example handles) [429] . Song et al. [430] present a hybrid grasp detection pipeline for familiar objects that fuses local grasp estimates based on depth data with global information from a continuous category-level pose estimation process, which is able to overcome the limitations of both approaches individually. Rather than rely on object models, Manuelli et al. [431] define grasping and manipulation actions by learning semantic 3D keypoints from visual inputs that generalise across novel instances of an object class, making them more robust to intra-category shape variation.

More recently, data-driven approaches backed by deep learning have proven very effective at predicting robotic grasps on arbitrary objects from visual information [432] [433] [434] [435] [436] , seemingly removing the immediate need for semantic information in the grasping pipeline, at least in terms of grasping alone. However, in many robotics applications, the ability to grasp and transport arbitrary objects in a stable manner is not enough in itself, but rather a prerequisite for higher-level manipulation tasks where the choice of grasp is dependent on the goal [437] . The representation of semantics in robotic grasping may take many different forms and is largely informed by the task to be completed.

In the simplest case, the semantics of a manipulation task may be defined at an object recognition level, where a grasp detection algorithm is combined with an object detection or semantic segmentation system (Section II). This is a common choice in pick-and-place style tasks, such as warehouse fulfilment, where the requirement for semantics is limited to object identification. A common successful approach in the Amazon Robotics Challenge was to use an object-agnostic grasp detection algorithm in parallel with a standalone semantic segmentation system [438] [439] [440] . Rather than use two separate streams, other approaches use joint representations of objectspecific grasp detection. Guo et al. [441] train a CNN to recognise the most exposed object in a cluttered scene while simultaneously regressing the best grasp pose for the detected object. This joint model outperformed a system comprising two models trained individually on object and grasp detection. [442] improve this idea by using a two-stream attention-based network that learns object detection, classification, and grasp planning endto-end. The two streams, ventral and dorsal, perform object recognition and geometric-relationship interpretation respectively, and outperform a single-stream model on the same task.

To grasp an object, humans rely not only on spatial information, but also knowledge about the objects, such as its expected mass and rigidity, and the constraints of the task to be performed [443] , for example, grasping a knife by the handle correctly to perform a cut, or not blocking the opening of a mug to pour water. In the same way, to perform higher level tasks robots must also understand the possible function ("affordances") of objects and how they relate to the constraints of tasks to be performed. To this end, a number of research works focus on encoding semantic constraints within a robotic grasping pipeline.

Affordance prediction can also be performed as a standalone task as a precursor to a separate grasp planning pipeline, relying largely on visual cues. Myers et al. [444] present an approach inferring the affordances of objects based on geometric information and visual cues. By combining hand-coded features with traditional machine learning methods they were able to predict object part affordances such as cut, contain, pound and grasp in RGB-D images of common kitchen and workshop tools. Nguyen et al. [445] extended this approach by using a CNN to predict pixel-wise affordances, removing the need for hand-designed features. Kokic et al. [446] use a CNN to detect affordances on arbitrary 3D objects based on geometric rules, for example, that thin, flat sections of objects afford (provide) support. The affordance detection is combined with a separate object classification and orientation stream to plan task-specific grasps.

Song et al. [447] , [448] present a probabilistic framework for encoding the relationships between features for task-specific grasping using Bayesian networks. Their model jointly encodes features of the target object, such as category and shape, task constraints and grasp parameters, hence modelling the dependencies between relevant features. However, the approach also relies on hand-crafted features for representing objects and task constraints, making it difficult to scale to extra tasks and objects. Hjelm et al. [449] use metric learning to automatically learn the relevant visual features for a given task from human demonstration. Along the same lines, Dang and Allen [425] present "semantic affordance maps" that link local object features (a depth image) along with tactile and kinematic data to semantic task constraints for an object. Given an object of a known class and a task, the optimal grasp approach vector is computed from the semantic affordance map, followed by a grasp planning step along the computed approach vector.

The above methods all employ some level of category-specific encoding, meaning that they won't generalise to new object classes. In contrast, Ardón et al. [450] proposed employing semantic information from the surrounding environment of a target object to improve inference of affordances. They used a weighted graph-based knowledge base to model attributes based on shape, texture and environment, such as locations or scenarios in which the object is likely to appear. Their method does not require a-priori knowledge of the shape or the grasping points. Detry et al. [451] present another approach to grasping previously-unseen objects in a way that is compatible with a given task. They use CNN-based semantics to classify image regions that are compatible and incompatible with a task in combination with a geometric grasp model. For training on tasks like "pouring", task constraints like "grasp away from opening" are used to define the suitable and unsuitable labels for vertices of object mesh models. The "semantic success" of a task is assessed based on whether the grasp is compatible with the task and "mechanical success", by having the robot attempt to lift the object off the table.

Combining semantic information with robotic grasping is likely a crucial step towards equipping robots with the ability to autonomously carry out tasks in unstructured and dynamic environments. While the above approaches all achieve impressive results, the largest outstanding questions still revolve around accurately and efficiently transferring semantic knowledge to new objects and tasks.

The following section looks at approaches that go beyond grasping and make use of semantic information for higher-level task completion.

3) Higher-level Goals and Decision Making: Semantic maps and information can be exploited to facilitate higher-level goal definitions and planning, for humanoids [287] , [292] and domestic robots [292] , [452] [453] [454] . These approaches often use RGB-D data to extract semantic information. For humanoids, [287] used RGB-D data to build representations of the environment from geometric primitives like planes, cylinders and spheres. This information is used to extract the affordances of objects, such as their "pushability" and "liftability", as well as full-body capabilities such as support, lean, grasp and hold. [290] extends the work in [287] with semantic object information, such as chair or window, and uses spatio-temporal fusion of multiple geometric primitives (using a spatial scene graph representation) to identify higher level semantic structures in the scene.

For object manipulation, [292] parses a given goal scene as an axiomatic scene graph composed of object poses and inter-object relations. A novel method is proposed for scene perception to infer the initial and goal states of the world. [452] utilizes various segmentation methods for generating initial hypotheses for furniture drawers and doors, which are later validated with physical interaction. Given a task, this approach is able to answer queries such as "Which horizontal surfaces are likely to be counters?" by processing knowledge built during mapping. [453] present an interactive labeling system, where human users can manually label semantic features, such as walls and tables, and static objects, such as door signs. These labels can then be used to define navigation goals such as "go to the kitchen", or for specialized behaviour such as for passing doors. [454] presents a goal-inferring approach that detects deviations from operational norms using semantic knowledge. For instance, if the robot "knows" that perishable items must be kept in a refrigerator, and observes a bottle of milk on a table, the robot will generate the goal to bring that bottle into a refrigerator.

Robotic tool use for manipulation brings the topics of object affordances, grasping and task-level decision making together. [455] studies tool affordances through self-exploration based on their geometry -how the tool is grasped. In particular, they study the dragging action, parameterized by the angle of the drag, with the effect measured as the object displacement. In more recent work, Fang et al. [456] address task-oriented grasping from the point of view of tool usage, where a robot must grasp an object in a way which allows it to complete a task. For example, the best way to grasp a pair of scissors differs depending on the task: transporting the scissors may require a different, safer grasp to using them to cut a piece of paper. Their approach uses self-supervised learning in simulation to jointly learn grasping and manipulation policies that are then transferred to novel tools in the real world. 4) Semantics for Human-Robot and Robot-Robot Interaction: Semantics play a crucial rule in human-robot interaction, in particular for collaborative task execution which goes beyond the human-command-robot-execute model. This is achieved using inverse semantics, introduced by Tellex et al. [457] , to enable the robot to generate help requests (for an assembly task) for robots in natural language, by emulating the ability of humans to interpret requests. Inverse semantics involved using a Generalized Grounding Graph (G 3 ) framework [458] for the task and using models based on both the environment and the listener. Gong and Zhang [459] proposed temporal spatial inverse semantics to enable a robot to communicate with humans by extending the natural language sentence structure that can refer to previous states of the environment, for example, "Please pick up the cup beside the oven that was on the dining table". They used the G3 framework for mapping between natural language and groundings in an environment, and evaluated their method through randomly generated scenarios in simulation.

Semantics have also been used to program a robot to perform a task using natural language sentences. Pomarlan et al. [460] developed a system that converts a linguistic semantic specification (like an instruction to a robot: "set a table (re-arranging four cups)") into an executable robot program, using interpretation rules. The resultant robot program or plan can be run in simulation and enables further inference about the plan. Savage et al. [461] developed a semantic reasoning module for a service robot that interprets natural language by extracting semantic role structures from the input sentence and matching them with known interpretation patterns.

Another venue for semantics in Human-Robot interaction is interactive robot learning from demonstration. Niekum et al. [462] proposed an approach to using demonstrations for learning different behaviors and discovering semantically grounded primitives and incrementally building a finite-state representation of a task. Recently, semantics, powered by deep learning, was used to enable a robot to understand salient events in a human-provided demonstration [463] or perform unsupervised learning of the robot's user type from joint-action demonstrations, in order to predict human actions and execute anticipatory actions [464] .

Mapping with Human-in-the-loop is another example where semantics and semantic mapping find application. Hemachandra et al. [465] presented a semantic mapping algorithm, Semantic Graph (a factor graph approach), that fuses information from natural language descriptions with low-level metric and appearance data, forming a hybrid map (metric, topological, and semantic). [466] proposed a mapless navigation system based on semantic priors, using reinforcement learning on joint embeddings created from image features, word embeddings, and a spatial and knowledge graph (using a graph convolutional network) to predict the actions.

In the context of autonomous navigation and motion planning, one of the desired capabilities of a robot is to actively understand its environment and act accordingly. Beyond the use of active SLAM techniques [467] [468] [469] to accurately explore a partially-known environment, a robot also needs to adhere to human-like "social behavior" for an enhanced human-robot interaction experience. Semantic reasoning potentially enables robot social awareness when planning motion around humans [470] , [471] , expressing socially competent navigation in shared public places. This is typically achieved through crowd flow modeling in different social scenarios and contexts [472] .

Robot interaction can occur not only with humans but also with other robots. Schulz et al. [473] developed a system that enables a pair of robots to interact by autonomously developing a sophisticated language to negotiate spatial tasks. This is achieved through language games that enable the robots to develop a shared lexicon to refer to places, distances, and directions based on their cognitive maps, and was demonstrated on real robots. Another example of robot-robot interaction is the computational model of perspective-taking introduced by Fischer and Demiris [474] . There, one robot assumes the viewpoint of another robot to understand what that other robot can perceive from this point of view. Such a perspective-taking ability has previously been shown to be of advantage in human-robot interactions [475] , [476] . 5) Vision-and-Language Navigation: Vision-and-language navigation (VLN) is a task where an agent attempts to navigate to a goal location in a 3D simulator when given detailed natural language instructions. Anderson et al. presented the first VLN task and dataset in [477] . Numerous methods have been proposed to address the VLN problem. Most of them employ the CNN-LSTM architecture, with an attention mechanism to first encode instructions and then decode the embedding to a series of actions. Together with the proposing of the VLN task, Anderson et al. [477] developed the teacher-forcing and student-forcing training strategy. The former equips the network with a basic ability for navigating using ground truth action at each step, while the latter mimics the test scenarios, where the agent may predict wrong actions, by sampling actions instead of using ground truth actions.

The trend of deep learning based methods benefiting from more training data is also true in vision-andlanguage navigation. To generate more training data, Fried et al. [478] developed a speaker model to synthesize new instructions for randomly selected robot trajectories. Other methods of increasing the training data available have also been proposed: Tan et al. [479] augment the training data by removing objects from the original training data to generate new unseen environments. To further enhance the generalization ability, they also train the model using both imitation learning and reinforcement learning, so as to take advantage of both off-policy and on-policy optimization. Predicting future actions is a key component of many approaches, but not all chunks of an instruction are useful for predicting this next action. Ma et al. [480] use a progress monitor to locate the completed sub-instructions and to focus on those which have the most utility in predicting the next action. Another way to improve navigation success is to equip the agent with backtracking. In [481] , Ma et al. propose treating the navigation process as a graph search problem and predict whether to move forward or roll back to a previous viewpoint. In [482] , each passed viewpoint is viewed as a goal candidate, and the final decision is the viewpoint with the highest local and global matching score. Most recently, Qi et al. [483] presented a dataset of varied and complex robot tasks, described in natural language, in terms of objects visible in a large set of real images. Given an instruction, success requires navigating through a previously-unseen environment to identify an object. This represents a practical challenge, but one that closely reflects one of the core visual problems in robotics.

Semantic representations extracted through static or dynamic scene understanding processes can be leveraged by robots to improve their performance at tasks, whether a basic task of localization or higher-level functions like intention prediction of other drivers on the road. Beyond the foundational semantic understanding research, taskor application-specific semantic representations are often required to achieve specific goals. In this section, we discuss various ways in which researchers extract or employ semantic representations for localization and visual place recognition, dealing with challenging environmental conditions, and additional considerations required for enabling semantics in a robotics context.

A major sub-component of the navigation problem is localization -assuming a map exists, how can a robot calculate its location within the map. This process occurs in a number of different ways and has differing, overlapping terminology. Localization often refers to calculating the robot's location in some form of metric map space, while place recognition often refers to recognizing a particular sensory snapshot or encoding of a distinct location, without necessarily any use of or calculation of metric pose information. In a SLAM system performing mapping, the act of recognizing a familiar location is called "closing the loop" or "loop closure". Depending on the sensory modality used, the terminology is often pre-fixed with the appropriate term -for example visual place recognition. In this section we survey the increasing use of semantics for localization and place recognition systems, with a focus on vision-based techniques given their natural affinity for semantics-based approaches.

Visual Place Recognition (VPR) is integral to vision-based mobile robot localization. Given a reference map of the environment comprised of unique "places", the task is to recognize the currently observed place (query) and decide whether or not the robot has seen this before. VPR is a widely researched topic in robotics, as reviewed in a recent survey [231] . Distinct from place classification or categorization, processes that produce place-level semantic labels representing the general appearance of the scene, here we discuss place recognition that produces a localization cue by identifying a specific place. In particular, we review the VPR methods that use semantic information at either scene, object, patch, edge or pixel level for effective localization.

1) Object-level Semantics for VPR: Semantic information in the form of object detection is often used for place recognition and localization, as well as in full SLAM systems (see Subsection III-C). [484] developed a hierarchical graphical model that enabled simultaneous object and place recognition using bidirectional interaction between objects and places. Similarly, [485] presented a hierarchical random field model based on SIFT [72] and GIST [486] using relative pose context among point features, objects and places. [487] explored the use of a semantically-labeled prior map of landmarks based on object recognition [488] for localizing a robot. They developed a sensor model to encode semantic observations with a unified treatment of missed detections, false alarms and data association, and demonstrated it functioning on simulated and real-world indoor and outdoor data. [489] proposed "semantic signatures" as descriptors of images comprising type and angle sequences of objects visible from a given spatial location. They used trees, street lights and bus stops as objects for this task.

With further advances in the field of object recognition, methods like Faster-RCNN [80] have been adopted for visual localization. Recently, [490] developed a hierarchical localization pipeline based on a semantic database of objects detected using [80] and SURF [84] descriptors. In particular, the coarse localization estimates based on object matching were refined using keypoint-level SURF matching to obtain the final result. [491] presented a novel place representation based on a graph where nodes were formed by holistic image descriptors and semantic landmarks derived from [80] ; the edges in the graph indicated whether or not the nodes represented the same landmark/place; finally, the learnt graphical embeddings were used for visual localization.

Another source of semantic information in urban environments is Geographical Information System (GIS) databases, which provide semantic and pose information for various objects like traffic signs, traffic signals, street lights, bus stops and fire hydrants. [492] extracted and combined GIS data with object semantics to improve both object detection (bounding box) and geospatial localization. They tackled the inherent noise within GIS information by formulating a higher-order graph matching problem, solved robustly using RANSAC. Coarse localization is enabled by searching over a dense grid of locations for high similarity between GIS objects and objects detected within the query image. Similar to [492] , [493] explored the use of Deformable Part Models (DPM) [488] along with GIS information for localization but based on "cross-view" matching, where a query image from street-level is matched against an aerial top-view GIS semantic map. They proposed a Semantic Segment Layout (SSL) descriptor to validate the spatial layout of semantic segments detected in a query image in accordance with the GIS map.

In order to localize precisely, certain semantic categories in the scene can be more useful than the others depending on the underlying application scenario. [494] presented a lane markings-based localization system, where a prior map is derived from aerial images and map matching is performed using ICP. Aimed at assisted and autonomous driving, [192] explored the use of highly accurate maps comprising visible lane markings and curbs for precise and robust localization. They employed a Kalman Filter [495] for map matching using stereo cameras and an IMU to achieve centimeter-level precision. [496] proposed using segment building facades to estimate camera pose, given a 2.5D map of the environment comprising building outlines and their heights. [497] developed an approach for geo-locating a novel view of a scene and determining camera orientation using a 2D map of buildings along with a sparse set of geo-tagged reference views. They detected and identified building facades, along with their geometric layout, in order to localize a query image.

2) Beyond Object-Level Semantics: Semantic segmentation of visual information can not only be leveraged at object or pixel level, but at edge level as well. [498] and [499] segmented the sky using Infrared images to use it as a unique signature for localization. The ultraviolet spectrum has also been explored for the task of sky segmentation based navigation [500] and tilt-invariant place recognition [501] . [502] explored an automatic skyline segmentation approach for an upward-facing omnidirectional color camera. These omni-skylines were observed to be unique for a specific location in a city, particularly in GPS-challenged urban canyons. [503] developed a skyline extraction and representation method for localizing in mountainous terrains. [504] proposed a "sky blackening" method to remove the sky region from images in order to improve place recognition performance across day and night cycles. Beyond sky-based representations, variations in weather conditions can also be used to localize a static camera, by analyzing the temporal variations in the appearance of the viewed scene [505] . Recently, [506] presented an image descriptor based on the VLAD aggregation [507] of semantic edge features [508] defined between distinct semantic categories like building-sky and road-sidewalk. [509] developed an image descriptor based on histograms of semantically-labeled "superpixels" [510] , [511] within an image grid, and demonstrated its utility for geo-localization, semantic concept discovery, and road intersection recognition.

3) Life-long Operation of VPR Systems: Appearance variability of places can occur at vastly different time scales. At one extreme, time of day, weather fluctuations and even momentary lighting changes contribute towards intra-day appearance changes. Longer-term factors such as seasons, vegetation growth, and climate change also lead to appearance variability. A significant additional contributor to appearance variations are human activities, such as construction work, general wear and tear, updating of signage/billboards/façades including increased digital signage, and abrupt changes to traffic flow. Compared to natural causes of appearance variability, the changes caused by human activities can be far less "cyclical" or predictable.

The fact that the appearance of places can change indefinitely and unpredictably imposes a requirement that localization and VPR systems remain accurate and robust under those continuous changes, throughout the lifetime of the system. Compared to the majority of research in VPR that has so far considered static datasets (though with significant appearance variations within the datasets), lifelong VPR is an under-investigated area. To handle continuous appearance changes, some approaches attempt to extract the "semantic essence" of a place that is independent of appearance, and to transfer the appearance of the place seen in a particular condition to unseen conditions [512] [513] [514] . To date, such methods have primarily been demonstrated on natural variations such as time of day and seasons. Another paradigm continuously accumulates data to refine the system [515] [516] [517] . The practicality of such approaches depend on the ability to continuously collect data from the target environment at high frequency (e.g. weekly), which can be achieved using deliberate recording schemes (e.g. mapping vehicles [518] , taxi fleets [519] , [520] ) or opportunistic and crowd-sourced schemes such as useruploaded videos [521] or webcams [522] . The fundamental challenge arises from the perpetually growing database, which demands a VPR algorithm that is scalable; the computational effort for inference and continuous refinement of the VPR system must grow slowly or not at all with the database size.

B. Semantics to Deal with Challenging Environmental Conditions 1) Addressing Extreme Variations in Appearance and Viewpoint: Semantic cues have been demonstrated to be of high utility for place recognition under challenging scenarios where appearance-and geometry-only features alone fall short, for example, dealing with extreme appearance variations across repeated traverses of the environment [523] . These non-uniform appearance variations are often caused by weather, time of day and season cycles. [524] presented an appearance-invariant localization system based on a 3D model of semantically-labeled points and curves, which were projected onto a single query image to minimize error for pose estimation. [525] proposed a temporal place segmentation approach based on semantic place categories [58] to improve VPR [526] under significant appearance variations occurring both within and across the traverses. [527] developed a method to learn salient image descriptions based on regions of the image which are geometrically stable over large time gaps (across seasons and weather changes). They combined this representation with an off-the-shelf holistic representation to obtain a robust descriptor.

Cross-view matching, typically referring to the problem of matching images across highly-varied viewpoints like aerial top-view versus ground front-view [200] , [493] , [528] , is a challenging task due to extreme variations in visual appearance and geometry between the matching pair, often requiring additional information to enable effective localization. [528] explored a cross-view geo-localization approach that used SVMs to learn relationships among image triplets: ground level image, its corresponding aerial image and a land cover attribute map. More recently, [200] presented a localization system for cross-view matching where aerial images were captured using UAVs. They developed a semantic blobs-based hybrid representation derived from pixel-wise semantic segmentation, and combined topological, metric, and semantic information for wide-baseline image matching, i.e., forward vs reverse and aerial vs ground.

In order to deal with simultaneous extreme variations in scene appearance (day-night) and camera viewpoint (forward-reverse), [201] proposed Local Semantic Tensor (LoST) as a global image descriptor based on pixelwise semantic labels and spatial aggregation of local descriptors derived from a CNN. They extracted keypoints from maximally-activated locations within the CNN feature maps, and performed semantic keypoint filtering and weighted local descriptor matching to re-rank the place matches for high-precision place recognition. [529] extended this approach with a concatenated descriptor that encoded semantics both explicitly and implicitly, in the form of selective semantic feature aggregation and a deep-learnt VLAD descriptor [530] , respectively.

2) Linking Place Categorization with Place Recognition: Bridging the gap between semantic place categorization and the place recognition problem, [531] proposed a 3-layer perception framework using a topological directed graph as a map of the environment. Each node represented a set of semantic regions, and each edge represented a set of rotation-recognition regions. This structural representation enabled the recovery of the optimal path for indoor semantic navigation based on the number of semantic regions encountered. [532] developed a framework for semantic localization that used 3D global descriptors based on a Bag of Visual Words (BoVW) approach [161] to train a classifier for categorizing rooms labeled as unique places.

3) Semantics within High-Level Representations: In this subsection, we discuss the use of image representations that capture shapes and patterns at a higher abstraction level, which may not be strictly semantically meaningful to humans. To some extent, this can be attributed to feature extraction from higher-order layers of CNNs and methods like Edge Boxes [533] that provide a "semantically-aware" representation of whole images or image regions. [534] explored the use of Edge Boxes to detect and represent "ConvNet Landmarks" for visual place recognition. [535] represented the environment using a co-visibility graph of semantic image patches [533] , where an edge was established only if patches were observed in the same image, unlike the common notation of edges as spatial relationships. [536] presented Semantically-Aware Attentive Neural Embeddings (SAANE) that fused appearance features and higher-order layers of the CNN to learn embeddings for matching places under a wide variety of environmental conditions. [537] demonstrated that the higher-order layers of CNNs encoded semantic information about a place, which can be utilized for partitioning the descriptor search space. [538] showed that the fully-connected layers of a place categorization CNN, being semantics-aware, exhibited viewpoint-invariance and hence enabled place matching from opposing viewpoints, even with the additional challenge of significant appearance variations. The use of additional modalities like pixel-wise depth for visual place recognition has also been explored in the past [539] [540] [541] [542] [543] [544] [545] , where meaningful representations in the form of surfaces [539] , planes [546] , and lines [540] , [545] have been shown to improve VPR performance. [547] proposed FGSN (Fine-grained Segmentation Networks) based on k-means clustering of CNN embeddings for self-supervised pixel labeling and 2D-2D point correspondence. They showed that fine-grained clustering, even though non-semantic, is more suitable for visual localization than methods based on a limited number of semantic classes, for example, the 19 in Cityscapes [112] and 66 in the Vistas [548] dataset. However, it was also observed that pre-training on semantic segmentation networks improved performance significantly, suggesting semantics would still play a key, albeit different, role.

Methods developed primarily in other research fields like computer vision often do not readily translate to robotic systems [549] . Many critical issues for successful robotics deployment are not addressed or not prioritized in many of the research fields from which robotics draws much of its inspiration. Robotic deployment has a range of additional constraints and opportunities compared to much dataset-based research, including the availability of multiple sensing modalities, limited computational resources, a focus on real-time and on-line deployment, and a range of problems including obscuration, clutter and uncertainty that are not adequately encountered in dataset-based research. Hence, further innovation is typically required to bridge the gap between a laboratory or dataset-demonstrated system and a deployable solution for robots "in the wild". In this subsection, we cover some of the relevant research bridging this divide that has not already been covered in previous sections.

1) Efficiency: In order to classify 3D objects in real-time, Maturana and Scherer [317] developed a simplified version of a 3D CNN [550] for real-time object recognition. [551] proposed RT3D: a real-time 3D vehicle detection method based on pre-ROI-pooling convolutions [552] that accelerate the classification stage, thus completing detection in real-time with detection accuracy comparable to state-of-the-art. More recently, [132] demonstrated real-time panoptic segmentation using a single-shot network based on a parameter-free mask construction operation that reuses dense object predictions via a global self-attention mechanism.

2) Noise, Occlusions and Clutter: In order to deal with noise, occlusion and clutter in robotic sensing and point-based representation, [553] designed 4D CNNs for object classification that process robust point-pair based shape descriptors [554] [555] [556] [557] [558] represented as 4D histograms. Occlusions also pose a challenge for object-based representation of the environment and is an active area of research in the context of object-based SLAM [559]- [564] . Moreover, occluded objects affect the evaluation of object detection systems [565] and require novel error measures to deal with partially-visible objects [283] . To deal with background clutter, Wang et al. [566] proposed first removing the background, posing it as a binary classification task, before segmenting the point cloud data into semantic classes of interest for autonomous driving like cars, pedestrians and cyclists.

3) Cost: As a cost-effective alternative to a LiDAR-based 3D point cloud, Pseudo-LiDAR [567] has been proposed to represent pixel depth estimated using stereo cameras as a 3D point cloud, leading to a significant performance improvement for 3D object detection. In a subsequent work, the authors presented Pseudo-LiDAR++ [568] to improve depth estimation of faraway objects while also incorporating a depth signal from cheaper LiDAR sensors that typically have sparse 3D coverage. [569] extended the concept of Pseudo-LiDAR to monocular systems, while also addressing the numerical and computational bottleneck of the dense Pseudo-LiDAR point cloud. Recently, Qian et al. [570] developed an end-to-end Pseudo-LiDAR framework based on differentiable Change of Representation (CoR) modules to further improve detection accuracy. 4) Uncertainty Estimation: Uncertainty estimation is essential for semantic understanding and decision making during mapping and interaction. Current approaches for uncertainty estimation include approximations of Bayesian deep learning [571] such as dropout sampling [572] , deep ensemble methods [573] and the recently proposed Stochastic Weight Averaging-Gaussian (SWAG) [574] . For object detection, Monte Carlo Dropout Sampling was used in [575] to measure label uncertainty. Other works proposed estimating the uncertainty in detecting the location of the object in the image, introducing the probabilistic bounding box [576] . In [577] , an alternative representation of probabilistic bounding boxes was introduced through the spatial distribution of a generative bounding box model. In [578] , probabilistic bounding boxes were generated by replacing the standard nonmaximum suppression (NMS) step in the object detector by a Bayesian inference step, allowing the detector to retain all predicted information for both the bounding box and the category of a detected object instance.

Dropout sampling for semantic segmentation can be slow for robotics applications, prompting [579] to propose a Region-based Temporal Aggregation (RTA) method which leverages temporal information using a sequence of frames. Other works expressed uncertainty as a measure of quality in the predictions, such as predicting the quality of the IoU for semantic segmentation [580] or per-frame mAP (mean average precision) in the case of object detection [581] . Recently, [582] showcased that unseen object categories can be mislabeled by state-of-theart semantic segmentation networks with high certainty. To mitigate this the authors presented the Fishyscapes dataset, for pixel-wise uncertainty estimation for autonomous driving to enable detection of such anomalous objects.

5) Multi-modal and Non-Vision Based Approaches: Given that robots typically carry a suite of sensors, effective integration of information from multiple modalities is also an active area of research. Towards this goal, [583] developed a multi-modal semantic space labeling system leveraging information from camera, laser scanner, and wheel odometry sensors. This work was further extended in [584] to incorporate high-level information as "properties" of a place, defined in terms of shape, size, appearance, and doorway (binary). While using multiple modalities, authors in [585] concluded that although combining features from depth and color drastically reduce uncertainty, depth information contributes more to the performance due to inherent illumination-invariance. For synchronized multiple modalities, [586] proposed a Local N-ary Patterns (LNP) descriptor to describe relationships among neighboring pixels of reflectance and depth images, as an extension of the Local Binary Pattern descriptor [587] . As a general framework for semantic place categorization using any sensor modality, [249] demonstrated the superiority of DBMM (Dynamic Bayesian Mixture Models) based temporal inference over a commonly used SVM approach.

With the success of deep learning based methods, amenable to early, intermediate and late fusion techniques, use of multiple modalities has now become much more accessible [588] , including in semantic-based approaches. The complementary properties of the modalities can be harnessed via multi-modal fusion. Recently, Feng et al. [589] reviewed the state-of-the-art deep learning based approaches for object detection and semantic segmentation that employ multi-modal fusion, particularly exploring the answers to the "what, when and how to fuse" questions. Chen et al. [590] proposed a 3D object proposal method using stereo cameras, posing it as an energy minimization problem exploiting object size priors and depth-informed features like height above the ground plane, free space and point cloud density. In subsequent work [591] , the authors presented the MV3D (Multi-View 3D) object detection network, which combines RGB images and a LiDAR's front and Bird's-Eye Views (BEV) to enable effective multi-modal fusion based on interactions between different layers of the network. A similar multi-modal fusion is proposed in [592] , dubbed AVOD (Aggregate View Object Detection), with a novel use of high-resolution feature maps, 1 × 1 convolutions and a look-up table for 3D anchor projections, achieving real-time and memoryefficient high detection performance. Vora et al. [593] presented PointPainting as a way to "decorate" the LiDAR point cloud with the semantic segmentation output of color images for improved 3D object detection, addressing the limitations of previous fusion concepts that led to "feature blurring" [594] , [595] and limited maximum recall [596] [597] [598] .

A subset of research in place categorization semantically categorizes places using sensors other than those used for visual perception. With the increasing focus on Internet of Things and the ubiquity of edge-based sensors, upcoming solutions for semantic place categorization might be able to leverage these additional sources of information and existing approaches to effectively use multi-stream data. [599] proposed place classification of geographical locations based on individual demographics, timing of visits, and nearby businesses using government diary data. The demographic and temporal features of individual visits included age, gender, arrival and departure details and season. The place categories included home, work, restaurant, library, place of worship and recreation. [600] extended this work by also considering additional cues like sequential information in the form of the periodicity of the individual visits, cross labels extracted from multiple visits to the same place, travel distance, and place co-occurrences. [601] presented a semantic place classification system using GPS trajectories of users. Their method is based on visit-level features including day of week, time of day, duration, and response rate, which were then used to infer place-level features. [602] developed a semantic place labeling system based on multiple sensor sources including Bluetooth, smartphone, GPS, motion activity, WLAN and time, where place categories included home, work, friend and family, nightlife and education.

Depending on the application type, the use of semantics can vary significantly. For example, [603] proposed a particle filter-based semantic state estimation system for semantic mapping and navigation of agricultural robots. Here, the "semantic states" (classes) were defined according to the topology of crop rows, considering the "side", "start", "end", and "gap" as the semantic elements of the agricultural field.

Semantics are clearly a promising avenue for enhancing the capabilities and utility of robotic systems. At this point in time however, the uptake of systems that make significant use of semantics varies significantly across different application domains, from autonomous vehicles to service robots to augmented reality applications. There are also a range of technology advances that will likely facilitate further advances in semantics-based robotics, ranging from advances in compute capability (and an associated reduction in cost, bulk, and power consumption, all relevant benefits where robotic systems are concerned) as well as in online "in the cloud" computational and data resources. In this section we cover these practical considerations, discussing current and future application scenarios for semantics in robotics, and cover the technology advances that will support these technologies into the future.

A number of robotics and autonomous platforms employing some degree of semantics can be seen in use today, with a much larger number in use at the research or proof of concept stage. Robotics and autonomous platforms span a wide range of application scenarios including domestic services: house cleaning (iRobot Roomba [604] ), lawn mowing (Robomow RS [605] ), pool maintenance (Zodiac VX65 iQ [606] ); scientific exploration: underwater in deep sea (AUV Sentry [607], [608] ), planetary rovers (Curiosity [609] , [610] ); hospitals: intelligent transportation (Flexbed [611] ), socializing with patients [612] , surgical procedures (Da Vinci [613] , [614] ), and providing care [615] , [616] ; social interaction (Nao [617] , Pepper [618] , Vector [619] ); telepresence (Beam [620] ); hospitality [621] ; shopping centers and retail [622] [623] [624] ; office environment [625] ; logistics and fulfillment [626] ; last mile delivery (Starship [627] ); and many others [628] .

In the following sections, we discuss the major application areas from the perspective of different autonomous platform types (UAVs, Service Robots, Static Platforms) as well as covering particular application domains (Autonomous Driving, Augmented Reality and Civilian Applications).

1) Drones and Unmanned Aerial Vehicles (UAVs): Drones and UAVs represent one of the most resourceconstrained robotic platforms, where balancing factors like flight time, payload capacity, cost, power, real-time operation and sensor choice requires careful consideration. As the application and operating scenarios can vary significantly, but are generally more constrained compared to say autonomous vehicles, research with regards to the use of visual semantics on drones is still relatively preliminary.

Numerous benchmark datasets have focused on UGVs, leading to significant advancements in research for enabling semantic scene understanding [112] , [519] , [629] . However, the same is not true for UAVs. [630] recently presented a UAV object detection and tracking benchmark that particularly highlights the contrast between the challenges of vision research for UAVs versus UGVs. The primary differences occur due to variations in object density per frame, relative sizes of observed objects, viewpoint, motion and real-time requirements. To further the research efforts in this domain, [631] released a benchmark challenge VisDrone2018 for object detection and tracking with 2.5 million annotated instances of objects within 180K video frames. [632] presented UAVid, an urban street scene segmentation dataset, particularly aimed at high-resolution slanted views from low-altitude flights, in contrast to the more common top view-based datasets [633] [634] [635] [636] . In a similar vein, [637] released AeroScapes, an aerial-view dataset of 3269 images with dense semantic annotation, captured using a fleet of drones. The scarcity of labeled data has also motivated researchers to explore alternative ways to utilize the existing data and approaches to semantic scene understanding. The authors in [637] used an ensemble approach to semantic segmentation of aerial drone imagery, based on knowledge transfer via progressive fine-tuning through different source domains. Similarly, [638] explored the use of GANs [639] for unsupervised domain adaptation to improve cross-domain semantic segmentation in aerial imagery.

With the availability of better benchmark datasets and tools, there's the potential for advances in semantic scene understanding for UAV applications mirroring those made for road-based vehicles. Currently, UAVs are used in a diverse range of application scenarios that differ in operating environments as well as the end task.

[640] demonstrated a Search And Rescue (SAR) system to detect and track objects on ocean surfaces where accessibility by other means is typically impractical. Similarly, natural environments that are prone to disasters like volcanic eruptions, forest fires or landslides can be better accessed by UAVs, which can aid in detecting and assessing such situations [641] . Nature conservation and wildlife management are also key areas where UAVs can offer a significant practical advantage over other platforms. [642] presented a nature conservation drone that detects, tracks and counts various animals in the wild. Extending this idea to the use of multiple devices, [643] developed a software platform that enables the use of different multirotors for cattle detection and management, mitigating the need to attach GPS devices to animals. [644] developed Systematic POacher deTector (SPOT) based on drones to spot poachers at night time. Beyond natural environments, UAVs have also been employed in the construction industry for site management and control [645] . Another distinct application area for UAVs is last mile delivery, which is already well on its way from research to commercial deployment [646] . As both the end-task and operating environment vary widely, the use of multiple sensors beyond just RGB cameras to achieve the objective is common, for example, the use of thermal cameras [640] , [644] and LiDAR [647] .

Autonomous UAV navigation will also benefit from advances in semantic understanding of the world. Driven in significant part by advances in the field of deep learning, autonomous navigation is a topic of active exploration in the context of drones and UAVs, especially enabling end-to-end autonomous navigation. [648] proposed Collision Avoidance via Deep Reinforcement Learning (CAD 2 RL) where training is done only using simulated 3D CAD models, demonstrating generalization to real indoor flights. [649] developed DroNet that takes as an input a monocular image and outputs the steering angle along with a collision probability. Distinct from [648] and other methods that use synthetic data for training, [649] explored the use of ground vehicle driving data from city streets and demonstrated generalization to variations in viewpoint (from high altitude flights) and types of environment (parking lots and corridors). This study in particular shows that there are underlying meaningful semantic concepts that are not specific to a robotic platform or operational domain.

Besides using data from simulation or a different application domain for learning, a somewhat uncommon paradigm for learning has been explored in [650] , based on real crash flights. The authors sample naive trajectories and crash into random objects (11500 times) to form a crash dataset, which is then used to learn a UAV navigation policy, demonstrating a drone's ability to navigate in an extremely cluttered and dynamic environment. Apart from a completely end-to-end pipeline, modular approaches have also been explored where perception and control are handled separately [651] [652] [653] . Similarly, [654] demonstrated autonomous navigation for a racing drone via sim-to-real transfer of its CNN training for predicting goal direction using data generated from domain randomization. More recent work in the field of drone navigation includes learning to fly from a single demonstration [655] , deploying navigation capability on mW-scale nano-UAVs [656] , and motion planning based on an opponent's reactions [657] .

Autonomous UAV navigation requires a high-level understanding of the environment, which can either be learnt via end-to-end training techniques or via modular methods that help to semantically reason about the surroundings. Towards the goal of semantic navigation, [658] employed the Observe-Orient-Decide-Act-loop (OODA-loop) decision-making theory [659] , [660] , and incorporated semantic information from the environment into the decision making process of a UAV to dynamically adjust its trajectory. A semantic reasoning capability can potentially enhance the performance in the end task of a practical application by adding situational awareness. In this vein, [661] proposed to improve object detection and tracking by integrating ontological statements-based semantics in their tracking method for improved awareness of drones in critical situations [662] . With a similar goal, [663] leveraged the relationships between objects and defined an ontology to describe the objects surrounding a UAV to better detect a threat situation.

Further advances in navigation and scene understanding for UAVs, particularly those based on semantic reasoning, will play a critical role in improving UAV autonomy and broadening the range of applications where deploying drones is feasible.

2) Service Robots: Much robotics research has been driven by a motivation to develop robots that can provide a range of services to humans. Within each of the application areas, there are several aspects of service robotics that need to be taken into account. Recent review and survey articles have discussed the uses, scope, opportunities, challenges, limitations and future of service robots in various contexts: human-centric approaches [664] , navigating alongside humans [665] , cultural influences [665] , needs in healthcare [666] , acceptance in a shopping mall [667] , consumer responses to hotel service robots [668] , requirements of the elderly with cognitive impairment [669] , uncertainty in natural language instructions [670] , roles in the hospitality industry [671] , mechanical design [672] , hospitals [673] , social acceptance in different occupational fields [674] , welfare services [675] , restaurant services [676] , service perception and responsibility attribution [677] , and planning and reasoning in general-purpose service robots [678] . These studies highlight the aspects of service robotics which are not necessarily technical but highly relevant to their effective deployment.

A commercially-viable service robot typically requires a hardware design that improves its utility in a range of application scenarios. For example, the modular design of Care-o-bot [679] enables application-specific modifications like replacing one of the arms by a serving tray, or using just its mobile platform for serving, beyond its prescribed use cases in grocery stores, museums or as a butler. Similarly, PR2 [680] is capable of performing house chores like cleaning a table and folding a towel, as well as fetching objects. Both these robots are equipped with a sensor suite that enables them to perform autonomous navigation, grasping and manipulation. These competencies are an active area of research which drives continuous improvement in the robotic solutions available. The underlying software platform for PR2 and many other robots is supported by the open-sourced Robot Operating System (ROS) [681] , making research studies easier and systematic [682] , [683] . Beyond ROS, specialized software development kits are typically available for robots to control or access their sensor data as is the case with Spot [684] , a new commercially available service robot that can move through trotting in both indoor and outdoor environments, while being capable of climbing stairs, opening doors and fetching a drink.

Apart from the hardware designs and underlying operating kernels, in order to complete a task and provide a service, a robot requires advanced perception and control abilities. This is typically achieved by using multiple sensors and performing tasks in a modular fashion. [685] proposed a multi-sensor based human detection and tracking system and demonstrated its portability on different robotic platforms using a Pioneer and Scitos robot. [625] presented a tea-serving application using P3DX robot in an office environment, where an overall environment awareness was achieved by developing individual competencies in isolation like line following, obstacle avoidance, empty tray detection, approaching-hand detection and person following. Recently, aiming for precision agriculture, [686] presented an intelligent pesticide spraying robot that semantically segments the fruit trees that require spraying. Using semantic reasoning, perception, control and interaction abilities can be further improved. [257] demonstrated smart wheelchair navigation using semantic maps, emphasizing enhanced safety, comfort, and obedience due to the use of semantics. [462] presented a semantically-grounded learning-from-demonstration approach for a furniture assembly task using a PR2. [461] developed a semantic-reasoning based method for interpretation of voice commands for improved interaction between a service robot and humans. [394] , [397] , [399] demonstrated the use of semantic representations for human/robot activity understanding using humanoid and industrial robots. More recently, [687] proposed Semantic Linking Maps (SLiM) that exploit common spatial relations between objects to actively search for an object based on probabilistic semantic reasoning.

Many service robots that work for or alongside humans require the ability to extract, represent and share semantic knowledge with their users or co-workers to perform these services intelligently and sometimes jointly. Semantic knowledge representation in the form of semantic maps enriches the metric and/or topological spatial representation that these robots traditionally carry. Similarly, semantically reasoning about the service task will close the gap between the way humans and robots understand their environment, which then facilitates more natural robot-user communication.

3) Static Platforms: Other than robotic platforms, a number of practical applications also involve static platforms based on perception sensors. We briefly cover these platforms here because it's likely that in future many robotic systems will operate in a "shared perception" environment where their onboard sensing is extended by static sensing systems. Many of the capabilities initially developed in a surveillance context, especially sophisticated scene understanding driven by semantics, are also likely to be relevant as these capabilities are ported into robotic platforms.

Visual surveillance is one such example. Video surveillance via Closed-Circuit TeleVision (CCTV) has long been used around the world, with millions of cameras installed. Recent surveys have highlighted the diverse usecases of visual surveillance and its research advances [688] [689] [690] [691] [692] [693] [694] [695] . The application areas for visual surveillance are vast and pose unique challenges depending on the task at hand, many of which are relevant to robotics.

[690] highlighted the use of visual surveillance in natural environments, for example, studying social behavior of insects in a group [696] , analyzing climate impact based on interactions between natural beings [697] , behavior adaptability under tough environmental conditions [698] and inspiring robots to mimic locomotion of certain animal species [699] . The authors also discussed the methods and challenges of background subtraction, often the first step in the process, particularly when tracking a moving object under challenging conditions like those posed by a marine environment. [700] presented the applications and challenges of visual surveillance of human activities. The authors covered a wide range of uncontrolled environment scenarios: traffic flow monitoring [701] , detecting traffic incidents [702] and anomalies [692] , vehicle parking management [703] , safety operations at public places like train stations [704] and airports [705] , maritime administration [706] and retail store monitoring [707] . As with the natural environments, monitoring human activities in man-made environments also poses similar challenges including background subtraction when tracking a moving object. Furthermore, as human activities can be unpredictable and have more serious implications for false negatives (not being able to detect an incident), visual surveillance in this particular context demands particularly robust and high performance techniques.

Irrespective of the operating environment, the key enablers of a visual surveillance system are the fundamental abilities to holistically and continuously understand the visual observations. These basic abilities are major areas of research and directly impact the performance of the surveillance system. Research includes gait recognition [691] , [708] , person re-identification [709] , [710] , face recognition [711] [712] [713] , single-object tracking [714] , multi-object tracking [715] [716] [717] especially in dense crowded scenes [718] , human activity recognition [695] , tracking across non-intersecting cameras [705] , object detection especially in the context of abandonment and removal [719] and incorporating contextual knowledge [707] . Many of these capabilities are also critical to robotic platforms, especially as they operate in human-rich environments.

The fundamental abilities required for visual surveillance and the challenges associated with them together dictate the need for better approaches that are based on semantic reasoning. The current solutions tend to address challenges that are often specific to the operating environment and the type of object being tracked. A more generalizable solution would require semantic scene understanding which can either be inferred from the observed scene or through the use of knowledge databases. With semantic reasoning, the fundamental task of detecting and tracking an entity of interest can likely be performed in a more robust and generalized manner. 4) Autonomous Vehicles: Industry has been largely responsible for increased research activity in the autonomous vehicle space, as evidenced by an increasing number of surveys and reviews dedicated to autonomous driving. Autonomous driving is a great test case for semantics, because it has rapidly become clear that classical robotics techniques, like SLAM, are clearly insufficient by themselves for enabling autonomous driving, and that a richer understanding of the environment is almost certainly required.

Recently, [589] reviewed multi-modal object detection and semantic segmentation for autonomous driving, describing various methods, datasets and on-board sensor suites while highlighting associated challenges. [18] surveyed 3D object detection methods for autonomous driving applications. [720] presented a survey of deep learning techniques for autonomous driving. [721] reviewed the available datasets for self-driving cars. [722] reviewed problems, datasets and state-of-the-art computer vision methods for autonomous vehicles and [723] reviewed driver-assistance systems. The tremendous push for autonomous cars from different companies has led to the availability of enormous labeled datasets captured from fleets of vehicles mounted with many sensors [519] , [724] , [725] . Both industrial and academic researchers have invested significant effort in understanding and solving the challenges posed in developing high levels of autonomous driving.

Scene understanding based on benchmark datasets has its limitations as it does not account for a number of scenarios which an autonomous robot might encounter in real-world. For example, detecting a "stop sign" may not be trivial due to occlusions, weather and lighting conditions, infrastructural variations, special sub-categories and temporary road blocks where humans carry the stop signs [726] , [727] . One of the practical solutions to such object recognition problems is active learning where approximate detectors can be used to automatically sample more data for less confident predictions, especially under a map-vision disagreement, to train better classifiers or handle special cases, as adopted by Tesla, Inc. [726] . The Tesla talk [726] also highlights how general evaluation methods are supplemented by curated unit tests for measuring the scene understanding capability of an autonomous car before deploying an updated solution. Furthermore, semantic scene understanding for autonomous driving is at times limited by differences in road infrastructure and traffic rules across different geographies, for example, left-versus right-lane driving and differences in positioning of traffic lights near intersections. Therefore, learning general trends via deep learning often needs to be complemented by incorporating prior knowledge in the algorithm implementation and sensor specification [728] .

Given the range of challenges in enabling autonomous driving, some focus has shifted to task-specific and highly-engineered solutions. Such approaches include multi-modal object detection (Uber) [594] and semantic segmentation (Bosch) [589] , Pseudo-LiDAR [567] based 3D object detection (Huawei) [569] , multi-view 3D object detection (Baidu) [591] , sequential fusion (nuTonomy) [593] , monocular 3D object detection [729] , and learning near-accident driving policies (Toyota) [730] . In particular, application-specific biases or assumptions have proven to be of benefit. [729] proposed 3D object detection from a single monocular image in the context of autonomous driving, leveraging various priors suited to the application, for example, size, shape and location of objects, ground plane context and semantic class selection. [731] presented SegVoxelNet, which utilizes semantic and depth context for 3D object detection using LiDAR point clouds to better identify ambiguous vehicles on road. Researchers are making good progress towards solving the more specific and narrow problems within the autonomous driving field [551] , [566] , [590] , [592] , [729] , [732] .

Autonomous driving and driving-assistance systems need to solve a variety of fundamental and high-level tasks. In order to make sense of raw sensor data, continuous effort is being made to improve semantic scene understanding using various techniques including multi-task fusion [733] , multi-sensor fusion [594] , [733] , panoptic segmentation [128] , [130] and open-set instance segmentation [141] . With enhanced semantic reasoning, higher-level tasks for autonomous driving can be better addressed, for example, motion planning [734] , [735] , 3D object tracking [736] , multi-object tracking [718] , pedestrian behaviour prediction [194] , long-range human trajectory prediction [737] , turn signal-based driver intention prediction [195] , predicting a pedestrian's intention to cross [738] and intention prediction in the form of trajectory regression and high-level actions [739] . Furthermore, for real-time operations, efficient methods are being developed for semantic segmentation [740] , [741] , 3D object detection [742] , [743] , online multi-sensor calibration [744] , semantic stereo matching [144] and object tracking [743] . As the current systems rely on high-definition maps for localization and detecting road infrastructure and fixed objects, compressing such maps for efficient operations is also being explored [745] , [746] . Semantic mapping, particularly suited to autonomous driving on roads, is being explored at deeper levels involving lane detection in complex scenarios [747] , road centerline detection [748] , [749] , lane boundary estimation [750] , complex lane topology estimation [196] , drivable road boundary extraction [751] , crosswalk drawing [752] , and aerial view-based road network estimation [753] . Recent research has also demonstrated that using semantic maps can improve performance at tasks like object detection [754] and localization [755] .

Autonomous driving is a challenging task and one that is likely to rely heavily on vehicles having a deep and accurate understanding of how the world around them functions. As research in this field continues it will be interesting to see how far progress is made adopting a road-driving-specific focus, versus a broader understanding of the world, and how many of the insights gained are then relevant to other robotic domains like service robotics and drones. Will semantic advances for autonomous vehicles provide the final piece of the puzzle to safely and widely deployable highly autonomous on-road vehicles? And will these advances be of high relevance to other robotic domains, or will there be domain-specific aspects of how semantics are learnt and used? 5) Civilian applications: A number of vision and robotics based solutions have been developed with primary applications in civilian society. This includes visual surveillance, robotic search and rescue operations, social interactions, and disaster management. Beyond static surveillance, discussed in the Subsection VI-A.3, visual surveillance based on moving cameras is also becoming more common [756] . This not only increases the scope of surveillance applications but is also now more feasible due to the availability of better hardware and software technologies. This includes body-worn cameras of police or defence personnel [757] and crowd monitoring using drones or ground robots. The advanced sensing capability and maneuverability along with low cost and size makes UAVs a great choice for a standalone surveillance system [758] , although their effect on privacy and civil liberty remains an active topic of discussion [759] . Visual surveillance using UAVs has been explored in various contexts including change detection [760] , scene classification for disaster detection [641] , construction site management [645] , and border patrol [761] . Similarly, ground robots (UGVs) have also been used for indoor [762] and outdoor surveillance [763] . [756] highlighted some of the key challenges of visual surveillance based on moving cameras, including dealing with abrupt motion variations, occlusions, power and compute time constraints and increased background complexity.

Apart from surveillance, another important class of civilian applications is in disaster and crisis management. [764] reviewed the use of robots in this context. A recent example is the last mile delivery operations run by Starship Technologies [627] during the COVID-19 pandemic. Similarly, airborne deliveries by Wing [646] and other such platforms can help in providing critical supplies during floods, landslides or bushfires, where ground accessibility is poor or non-existent. For operation in disaster situations, remote teleoperation is still a primary control mechanism -thus mitigating its corresponding challenges is also important. In this vein, [765] explored the use of arbitrary viewpoints to mitigate blind spots for improved teleoperation of a mobile robot. Beyond vision-based sensing, [766] used ARMatron, a wearable glove as a gesture recognition and control device to better communicate with a remote robot. Furthermore, disaster management scenarios also require better communication protocols for robots to avoid system failure. For example, a robot-drone-fog device team may communicate by creating an ad-hoc network in a search and rescue mission [767] . Finally, sensor fusion is also a critical requirement for many such tasks, as demonstrated by [768] for robot-assisted victim search.

The end goals for most robotic applications require both detection and action processes to work in tandem. Robots need to have robust means of understanding their environment, and that understanding has to be consistently verified as the robot acts. As also discussed previously, situational awareness can be attained by semantic reasoning [662] , [663] . Furthermore, semantic understanding of the robot's surroundings also enhances the possibilities for interacting with those surroundings, including with humans. 6) Augmented Reality: Over the past decade, Augmented Reality (AR) applications have become a relatively common phenomenon. Augmented reality is also highly relevant to robotics and covered here because many of the capabilities -such as localization -are shared across both areas, and both rely on an enhanced understanding of the environment that moves beyond simple geometry. [769] recently reviewed AR in depth, covering its various components ranging from tracking and display technologies to design, interaction, and evaluation methods, while also highlighting future research directions. Other recent surveys and reviews include a summary of major milestones in mobile AR since 1968 [770] ; opportunities and challenges facing AR [771] , [772] ; AR trends in education [773] and browsing [774] ; AR in the medical area including developments and challenges [775] , radiotherapy [776] , image-guided surgery [777] , neurosurgery [778] , endoscopic sugery [779] , laproscopic surgery [780] , and health education [781] , [782] ; and general trends in AR research spanning from 2008 to 2018 [783] , following a previous decade-span survey [784] and other older surveys [785] , [786] . As noted in [769] , earlier attempts at developing an AR prototype date back to 1968 [787] and the current widespread use can be be witnessed in games, education and the marketing industry. While some of the initial work in the field of AR had been primarily based on sensors like gyroscopes, GPS, and magnetic compasses [788] [789] [790] [791] [792] [793] [794] , vision-based localization has gradually become one of the key areas of research focus, in order to achieve high accuracy and a more immersive experience [795] [796] [797] [798] [799] [800] .

With visual SLAM and high-level scene understanding strongly tied to AR, the latter has long been a part of the research motivation for the vision and robotics research community. Real-time dense visual SLAM methods, for example, KinectFusion [216] , [801] paved the way for accurate surface-based reconstruction, enabling more immersive AR experiences in real-time. Doing away with non-parametric representations [216] by instead using semantics in the form of known 3D object models, SLAM++ [198] demonstrated a context-aware AR application using an object-based SLAM pipeline, where virtual characters completed the task of path finding, obstacle avoidance, and sitting on real-world chairs. In subsequent work [802] , the authors developed a real-time dense planar SLAM system and demonstrated its use in augmenting planar surfaces, for example, replacing carpets and styles on the floor and overlaying a Facebook "wall" web page on a real wall. Further to that work, [803] proposed ElasticFusion: a surfels-based map-centric approach to real-time dense visual SLAM and light source detection, leading to a more realistic augmented reality rendering.

The application areas of AR are vast, for example, in education: teaching electromagnetism [804] ; in ecommerce: Webcam Social Shopper [805] , a virtual dressing room software and IKEA's virtual placement AR system [806]; in retail stores: Magic Mirror [807] by Charlotte Tilbury; in games: PokemonGo's popular pokemon interaction [808] ; in navigation: Live View by Google Maps [809] ; and in medical: VeinViewer, a subsurface structure visualizer [810] , TRAVEE, an augmented feedback system for neuromotor rehabilitation [811] , PalpSim, a needle insertion training system [812] , training for planning tumour resection [813] and 3D intraoperative imaging and instrument navigation [814] . Image/3D registration is one of the critical components for enabling an accurate AR application. In medical procedures, this task is difficult due to a limited field-of-view vision, rapid motion, organ deformation, lack of sufficient texture, occlusion, clutter, and the inability to add artificial markers. Visual SLAM techniques which are primarily aimed at autonomous navigation for mobile robots typically make assumptions that limit their transferability to other domains, leaving significant research to be done in medical applications [815] [816] [817] .

In an attempt to further advance this technology, a number of software platforms like ARToolKit [818] , [819] , Vuforia [820] , ARCore [821] and EasyAR [822] have been developed that help in rapidly enabling mobile AR applications. Head-mounted display devices, for example, optical see-through devices like Google Glass [823] and HoloLens [824] are relatively mature AR devices in terms of a combined hardware and software solution. Such technology typically demands robust, accurate and real-time semantic scene understanding. Like autonomous vehicles, much progress has been made by making environment-specific scope assumptions.

Current capabilities in areas like scene understanding are in part due to advances that have occurred in parallel to research developments, for example, better hardware that led to the use of GPUs and the availability of large-scale datasets. Similarly, further progress in semantic mapping and representations will rely on some of the upcoming enablers and enhancers that will facilitate the development of systems capable of high-level semantic reasoning. These enablers include better computational resources and architectures at both hardware and software level; effective and more common usage of knowledge databases and the ability to better leverage the diverse range of existing sensor data; advances in the field of online, networked and cloud robotics and Internet of Things (IoT); and novel mechanisms for enabling human-robot interaction involving uncommon sensing and engaging capabilities. These aforementioned technologies are active areas of research and in the following sections, we highlight how they have been incorporated or potentially could be integrated in robotics to enable semantic reasoning, ultimately leading to more scalable, sophisticated and robust robotic operations.

The growing computational needs of deep learning-based techniques, including those developed for semantic scene understanding, have led to rapid development of improved hardware and computational resources. Notably, GPUs continue to advance, with recent hardware changes being increasingly focused on facilitating deep learning and related techniques, rather than just catering to the traditional consumer gaming, media production and modelling users. Key performance properties relevant to deep learning include memory bandwidth, total memory capacity, clock frequency, and architectural considerations that improve typical operations like inference. All these properties continue to advance rapidly from year to year. [8] highlighted the gap between the performance requirements of embodied devices and their practical constraints, indicating the need for co-design of algorithms, processors and sensors. The high compute requirements of visual SLAM and 3D reconstruction algorithms often make extensive evaluation difficult, especially for typical university research labs, so evaluation is often limited to qualitative visualizations and accuracy estimation on a few datasets. This trend deters the uncovering of design choices at algorithm level that could potentially lead to better tradeoffs between accuracy, compute time, power consumption and the quality of output. Hence, better design space exploration strategies are continuously being explored that involve both hardware and software, particularly taking into account the latter's low-level design [825] [826] [827] [828] [829] .

When optimizing for computational and power performance, data communication overheads (sensor to processor and processor-memory) play a major role by adding latency. [830] introduced a processor-per-pixel arrangement-based vision chip [831] for low-level image processing tasks which are inherently pixel-parallel in nature. Such vision chips ensure that data is processed adjacent to the sensor, thus reducing data transfer costs. These vision chips belong to the class of Focal-Plane Sensor-Processor (FPSP) chips [832] . For example, the SCAMP-5 vision chip [833] operates on image-wide register arrays enabling data (full image) transfer from sensor to processor array in one clock cycle (100 ns), delivering 655 GOPS at 1.2 W power consumption [834] .

Recently, [835] demonstrated the use of this chip with a novel energy-efficient CNN, resulting in significant gains in compute time (85%) and energy (84%) efficiency, while achieving high accuracy (90%) at a very high frame rate (3000 fps) for a handwritten digit recognition task.

Besides designing FPSP chips, neuroscience-driven vision chips [836] , [837] and the use of Field Programmable Gate Arrays (FPGAs) is increasingly becoming popular as a means to accelerate deep learning-based applications, as has been reviewed recently [838] [839] [840] [841] [842] [843] . Neuromorphic vision chips have been discussed in detail in [844] and more recently in [845] , where the latter discussed both frame-driven and event-driven vision chips. [846] comprehensively reviewed the neuromorphic computers, devices and model architectures that at some level model neuroscience to solve challenging machine learning problems. Neuromorphic computing provides opportunities to implement neural networks-based applications, particularly those based on Spiking Neural Networks (SNNs), directly in the hardware, thus offering significant advantages in terms of power, compute time and storage footprint. This hardware is typically implemented as FPGAs [847] or an Application-Specific Integrated Circuit (ASIC) [848] . Robotic applications, especially those requiring autonomous navigation competencies, have been explored with the use of spiking neural networks [849] [850] [851] [852] . Furthermore, existing CNNs can be converted into SNNs and mapped to spike-based hardware architectures while retaining their original accuracy levels, as demonstrated on the object recognition task in [853] [854] [855] . This opens up opportunities to better leverage advances in these hardware platforms while also motivating researchers to bridge the gap between the state of the art in learning using ANNs and SNNs.

2) Knowledge Repositories and Databases: The availability of knowledge repositories, datasets and benchmarks has always proven to be an accelerator of research in both the computer vision and robotics community. While knowledge databases help in providing contextual information and designing semantically-informed systems, sensor data in the form of images, videos, depth and inertia help in developing data-driven learning-based systems.

With deep learning as such an integral component of most semantics-related research, it has become even more important to obtain both accurate and large-scale labeled data to enable deep and supervised learning, for example, in the case of object detection and semantic segmentation. The recent focus has also been on creating equivalent benchmark datasets for 3D semantic scene understanding. In Table I , we list a sampling of datasets that are targeted at enabling semantic understanding by solving diverse tasks, for example, object detection in 2D [548] , [856] and 3D [519] , [857] , semantic segmentation in 2D [112] and 3D [135] , [858] , semantic place categorization [59] , robotic manipulation [411] , road semantics [749] , multi-spectral semantics [139] , UAVview semantics [637] , 3D object shapes [322] , multi-object tracking [716] , pedestrian intention prediction [738] , pedestrian locomotion forecasting [859] , collision-free space detection (drivable vs non-drivable) [860] , Near-InfraRed segmentation [861] , and semantic mapping [862] . This list of datasets is by no means comprehensive, and more details can be found in recent dataset review articles [22] , [589] , [721] , [863] and online [864] , although any single source does not provide a complete list, emphasizing the disconnect between the use of semantics for different tasks and applications. Table I mainly highlights the diversity in semantic understanding tasks, which are typically related to each other but often evaluated in isolation. It suggests that any future general semantic capability for robots might also require a more unified approach to the development and use of these [318] Pixel-wise labels along with a 3D Mesh 1 million 3D points outdoor forming several Haussmanian-style facades and 8 semantic classes. MANIAC (2015) [411] Semantic annotations of manipulation activities 8 unique activities, manipulating 30 objects SUN RGB-D (2015) [856] Polygons and bounding boxes with object orientations (RGB-D) and scene category 10,000 indoor RGB-D images with 146,617 2D polygons and 58,657 3D bounding boxes. SUNCG (2016) [335] Volumetric and object-level labels for synthetic indoor scenes 45,000 scenes with realistic room and furniture layouts. SceneNN (2016) [857] Scene meshes labeled per-vertex and perpixel with object poses and bounding boxes 100 indoor scenes Cityscapes (2016) [112] Pixel-and instance-wise labels 5,000 fine and 20,000 coarse annotations for 30 semantic classes from outdoor scenes across 50 cities. ScanNet (2017) [865] 3D camera poses and dense and object instance-level labeled surface reconstructions (RGB-D)

2.5M views in 1513 indoor scene sequences SEMANTIC3D.NET (2017) [866] Labeled 3D point clouds 4 billion manually labeled points and 8 semantic classes from urban outdoor scenes. Freiburg Forest (2017) [139] Multispectral and multimodal pixel-level labels Approximately 70,000 images and 695 million points from large-scale indoor spaces (6,000 m 2 ) Places (2017) [59] Place-level labels 10 million images with 400+ semantic place categories. Robot@Home (2017) [862] Room-, object-and point-level labels of 3D indoor reconstructions (RGB-D) 36 rooms (8 categories) and 1900 objects (57 categories) observed in 69,000 images captured from traverses of a mobile robot at home. PrOD (2020) [576] Pixel-and object-level labels and bounding boxes for synthetic images Indoor image sequences captured from multiple robots in a domestic environment. A2D2 (2020) [519] Pixel-and object-level labels and 3D bounding boxes (RGB-D) 41,280 frames with 38 semantic classes captured from multiple outdoor traverses.

datasets [749] . Semantic-based approaches typically rely on pre-learnt material from carefully curated datasets. As the range, size and variety of these datasets improves, so likely will the outcomes. An alternative means of incorporating prior information is to make use of a pre-defined ontology and common knowledge with the help of knowledge graphs [868] , [869] and ontology languages [870] [871] [872] .

In the context of robotics, several researchers have explored different ways of encoding the relevant knowledge and proposed a corresponding ontology. As highlighted in [873] , these ontologies have been applied in different ways: path planning and navigation [874] [875] [876] , describing an environment [877] , task analysis for autonomous vehicles [878] , task-oriented conceptualization [879] , as meta-knowledge for learning from heuristics of a system [880] , policies to govern behavior [881] , describing robots and their capabilities [882] , control architecture concepts [883] , and characterizing sub-domains within robotics [884] . Furthermore, given the ways of representing knowledge [885] [886] [887] , its management [888] and processing [889] frameworks have also been explored. Recently, [890] reviewed ontology-based knowledge representations for robotic path planning.

The use of knowledge representation and ontology has been explored in semantic mapping for robots. [891] proposed an "anchoring" process to link the spatial and semantic information hierarchies for a semantic map representation. For representing knowledge, authors used the NeoClassic [892] system to encode the semantic (conceptual) information hierarchy. [893] presented a hierarchical map representation where the highest level of abstraction forms the conceptual layer, representing knowledge in the form of abstract concepts (TBox with terminological knowledge) and instances of such concepts (ABox with assertion knowledge). Using the OWL-DL ontology of an indoor office environment, the conceptual map is defined using taxonomies of room types and objects found therein in the form of is-a and has-a relations. Beyond considering just the objects in the space, [312] implemented a constrained network solver using Prolog to encode the environment and in particular, the relationships between different planes (walls, floor, ceiling) as being parallel or orthogonal. Using the knowledge processing system, KnowRob [889] , [894] presented a semantic mapping system, KnowRob-Map, that links objects and spatial information with two sources of knowledge: encyclopedic knowledge, inspired by OpenCyc [895] , comprising object classes and their inter-relation hierarchies and common-sense knowledge, based on Open Mind Indoor Common Sense (OMICS) [896] , comprising action-related knowledge about everyday objects. Riazuelo et al. [238] demonstrated the benefits of combining SLAM and knowledge-based reasoning through RoboEarth [897] , a semantic mapping system. In the proposed system, the RoboEarth knowledge base stores "action recipes" (hardware-and environment-agnostic abstract descriptions of a task), sets of object models, and robots' models in Semantic Robot Description Language (SRDL) [898] . For a requested action recipe (e.g. exploration), based on the robot's capabilities, an execution plan tailored to the robot is generated on the cloud. For plan execution, the robot downloads a set of object models expected in its environment for building a semantic map, which can be uploaded to RoboEarth for future use by other robots. Using the proposed system, the authors demonstrated two tasks: semantic mapping of a new environment and novel object searching based on semantic reasoning.

Addressing the gap between the use of a pre-defined knowledge database and the ability to gain new knowledge, [899] presented the concept of a "multiversal" semantic map based on probabilistic symbol grounding. This approach enables reasoning through multiple ontological interpretations of a robot's workspace. [900] used additional linguistic cues for text-based image retrieval, showcasing the potential of using knowledge databases to improve retrieval accuracy. Beyond entity-level knowledge graphs, [901] proposed HealthAidKB, a knowledge base with 71000 task frames structured hierarchically and categorically, mainly focused on procedural knowledge for queries that require a step-by-step solution to a problem at hand. Using Probabilistic Action Templates (PATs) [902] , [903] , [904] developed a probabilistic effect prediction method based on semantic knowledge and physical simulation to predict a robot's action success. [238] , [900] and a body of similar work [905] [906] [907] attempt to achieve generalization and solve the open-set problem that persists especially in visual perception systems. Using prior knowledge and learning from various new knowledge sources is likely to be key to further advances in the semantic reasoning capability of robots.

3) Online and the Cloud: Cloud robotics [908] or online [909] and networked [910] , [911] robotics mainly refers to the use of internet or local communication networks to share resources and computation between multiple robots or applications. It is also closely related to the concept of Internet-of-Things (IoT) [912] [913] [914] . Goldberg and Kehoe [915] reviewed five different ways in which cloud robotics and automation can be put to use: shared access to annotated sensor data, on-demand massively-parallel computational support, sharing of trial outcomes for collective learning, provision of open-source and open-access code, data, and hardware design, and on-demand human guidance for task support -all areas of relevance to semantics. Furthermore, [916] presented a concept of Robotics and Automation as a Service (RAaaS) that combines the ideas of Infrastructure as a Service (IaaS), Software as a Service (Saas) and Platform as a Service (PaaS). Wan et al. [917] reviewed the development process of cloud robotics and its potential value in different applications: SLAM, navigation and grasping. [918] designed a cloud robotics architecture comprised of two sub-systems: a communication framework (machine-to-machine and machine-to-cloud) and an elastic computing architecture. The latter is defined using three different models: peer-based, proxy-based and clone-based, where their suitability to a particular robotic application is based on key characteristics like robustness in network connections, interoperability and flexibility for mobility within the network. The authors also highlight the key challenges for such a cloud system in terms of computation, communication, optimization and security. Under the constraints of limited communication, [919] presented a multi-robot SLAM system based on Rao-Blackwellized Particle Filters [920] that required only a small amount of data to be exchanged across robots. A number of recent articles have reviewed the current trends in cloud robotics [921] [922] [923] [924] [925] [926] [927] . Saha et al. [922] highlighted the increasing range of applications for cloud robotics in various areas: manufacturing, social, agriculture, medical and disaster management.

Visual SLAM approaches can benefit significantly from an online and cloud based system. The Parallel Tracking and Mapping (PTAM) framework [928] developed by Klein and Murray has been one of the landmark works in the modern history of visual SLAM. It demonstrated the advantages of alternation and parallel computing (of tracking and mapping), leading to a robust real-time system with accuracy comparable to offline reconstruction. A number of present state-of-the-art visual SLAM and 3D reconstruction systems use similar principles [209] , [210] , [216] , [929] . However, the scale at which such a framework can be used is directly related to the compute resources and storage available on the embodied device, as well as the possibility of robot collaboration. Hence, online and cloud-based alternatives have been explored in the literature aimed at widely expanding the scope of using robots for various tasks. In [930] , the authors developed a Cloud framework for Cooperative Tracking and Mapping (C 2 TAM) as an extension to PTAM [928] . In this framework place recognition and non-linear map optimization is performed on the cloud, while tracking and re-localization is handled by the robot client. A cloud-based framework not only provides computational benefits but also the opportunity to share work tasks and knowledge, for example, in multi-robot exploration and mapping [931] [932] [933] [934] .

Some recent advances in the context of cloud computing and its use in robotics include reinforcement learning based resource allocation [935] , real-time object tracking over the internet using UAVs [936] , [937] , FastSLAM 2.0 [938] as a cloud service [939] and cloud-based real-time multi-robot SLAM [931] . One of the critical enablers of cloud robotics is network offloading where a number of challenges still remain to be solved [925] . In the context of robotic tasks based on semantic scene understanding, researchers have explored a number of ways to use a cloud-based paradigm. [940] demonstrated the use of semantic databases and sub-databases on the the CloudStack [941] ), comprised of text-based sub-database selection (tableware versus electric appliances) and vision-based representations of various objects. A semantic map of the environment is then constructed using "belonging-annotation" positional relationships between objects. In an industrial automation application, [942] proposed a decentralized robot-cloud communication architecture for autonomous transportation within a factory, particularly highlighting the benefits of their approach under communication and hardware failure conditions. More recently, [943] presented an analytical optimization of the trade-off between communication and computational delay in a network of homogeneous sensors.

The knowledge repositories and relevant datasets discussed previously also lead to the use of online cloud services in order to improve knowledge sharing and enabling learning from other similar resources. The use of cloud services also means that a lot of heavy-lifting tasks can be done on the cloud including incremental learning [944] , [945] , large-scale retrieval [946] , [947] , intensive data association [948] , [949] and non-convex optimization [950] , [951] . Furthermore, cloud infrastructure brings robots within the IoT and also enables effective collaborative task solving that involves multiple robots meaningfully interacting with each other and other machines or humans. This approach may, eventually, lead to solving literally city-scale complexity problems, like building and administering a smart city [952] , [953] .

4) Human-robot interaction: One of the key aims of equipping robots with rich semantic representations of the world is to enhance and naturalize human-robot interaction. The literature points towards the use of semantics to build expressive robots [954] that can use effective modalities to coordinate their verbal and nonverbal interaction with humans. This includes human-like motion of robotic arms [955] , semantic reasoning-based natural language understanding for service robots [461] , human-robot dancing [956] , [957] , expression of emotions (both facial and full body [958] [959] [960] ), and the ability to build and navigate in human-centric semantic spatial representations [893] , [961] . More recently, [962] proposed object-oriented semantic graphs based on a graph convolutional network to generate natural questions from the observed scene. Linked with inverse semantics [457] , such systems enable a robot to ask the right questions in order to better communicate, ask for help and solve the task accordingly.

Robotic perception, world modelling, and decision making have evolved beyond their early limited focus on geometry and appearance. As we have discussed, modern approaches increasingly incorporate semantic information, which enables a higher-level and richer understanding of the world. In return, a variety of new robotic applications have already emerged, with new applications on the horizon. Still, many exciting directions for research remain. One open questions is about explicit or implicit representation of semantics: should we as researchers and algorithm designers enforce semantic information being explicitly represented, or do we enable an algorithm to implicitly learn task-relevant semantic concepts? Do we understand which semantic concepts are relevant for a particular robotic task, and is there a direct mapping between robotrelevant and human-relevant semantic concepts? In an age where explainability and the "trustworthiness" of autonomous systems is becoming increasingly important, how could we understand and interpret robot-learnt semantic concepts that bear no direct correspondence to any semantic concepts we are familiar with as humans? While it would be "neat" if the concepts are the same across robots and humans, if optimal robot performance involves semantic learning and representations that aren't directly interpretable by humans, this is likely to be an ongoing area of research focus.

At the moment, most semantic representations assume a flat structure. However, many semantic concepts can be naturally organised in a hierarchical structure (or even into a more general, graph-like structure). This is maybe most obvious for object class labels (object → indoor object → furniture → chair), but might also be a useful representation for affordances, room or place categories, and other semantic domains. This hierarchical structure also extends spatially, where semantic concepts might be expressed on the scale of object parts, objects, functional ensembles of objects, rooms, buildings and city blocks. Hierarchical semantic knowledge is likely not to be a "clean" representation, but rather to involve levels where the distinction between levels can be quite fuzzy; take for example a barista's coffee machine, which is both a single, complex object and an ensemble of objects. Understanding the different aspects of a semantic hierarchy can be important for robotic applications, especially when faced with imperfect perception or incomplete knowledge.

This discussion soon leads to more principled questions about the very nature of semantic concepts and how they can be represented to be accessible by robotic algorithms. Unless one follows the promises of pure and all-encompassing end-to-end learning, prior knowledge about semantic concepts needs to be modeled and represented in some form. This has already been investigated extensively in classical AI knowledge representation and reasoning research. An interesting middle ground is to try and incorporate hand-crafted semantics as priors into learning-based systems that can expand, continue to learn, or even re-learn semantics in a task-informed way. Given the wide range of potential applications of robotics with varying levels of complexity and operating requirements (especially around safety and reliability), a spectrum of approaches may be needed, informed by context; from pure end-to-end to hybrid, to entirely hand-crafted.

As noted in the survey's coverage of datasets relating to semantics research, there can be significant disconnects between different research subfields that touch on semantics. To achieve sufficient progress in application-focused areas like autonomous vehicles, much of the work in semantics has focused on domain-specific implementations, leading to higher performance levels but at the possible cost of generality for all of robotics. It will be interesting to see to what extent the insights gained from targeted semantics research in an area like autonomous driving will benefit the robotics field more broadly, just as it will be interesting to see whether it is possible to make enough progress by focusing on providing robots -or in this case autonomous vehicles -with only a targeted, constrained understanding of the world around them.

Like many other topics, progress in semantics-related research suffers to some extent from a disconnect across disciplinary boundaries, especially between robotics and computer vision. A related issue is the dominance of dataset-based evaluation, especially for tasks like semantic segmentation. Papers at the leading conferences, arguably predominantly in the computer vision discipline at conferences like CVPR, are dominated by research that achieves new levels of performance on these benchmark datasets. Although there is a promising trend towards evaluation in high fidelity simulation environments, performance on datasets and simulation alone is only one step towards safe and reliable deployment on robotic platforms, as has been revealed in application areas such as autonomous on-road vehicles. Bridging the divide between performance on datasets and performance on robots in closed-loop scenarios is likely to remain a major challenge for the foreseeable future, but also presents a unique opportunity for what we might coin "active" semantics, where semantic learning and understanding are enhanced by active control of a robotic platform and its sensing modalities.

This survey has summarized the state-of-play with respect to semantics research: the fundamentals, and the increasing integration of semantics into systems addressing key robotic capabilities like mapping and interaction with the world. While much progress has been made in the quest to imbue robots with a richer and nuanced understanding of the world around them, there is much still to be done. Future research will benefit from developments in the technologies and datasets underpinning much semantics-based research, as well as from new conceptual approaches, including those discussed here. The use of semantics in robotics will also continue to be informed by humans. Human communication makes rich use of semantic concepts of various kinds; we formulate tasks, give instructions and feedback, and communicate expectations based on the meaning of objects, affordances, or the wider spatial and temporal context. Incorporating semantics into robotics, especially by bridging classical approaches with modern, learning-based approaches could have great impact on future robotic applications, especially those where robots work closely with, for or around humans.

Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age

Simultaneous localization and mapping: part i

Simultaneous localization and mapping (slam): Part ii

Probabilistic robotics

Simultaneous localization and mapping

A review of recent developments in simultaneous localization and mapping

Semantic mapping for mobile robotics tasks: A survey

FutureMapping: The Computational Structure of Spatial AI Systems

Futuremapping 2: Gaussian belief propagation for spatial ai

A survey of image semantics-based visual simultaneous localization and mapping: Application-oriented solutions to autonomous navigation of mobile robots

Deep learning in neural networks: An overview

Deep learning for generic object detection: A survey

A review of algorithms for filtering the 3d point cloud

A comprehensive review of 3d point cloud descriptors

Deep learning on point clouds and its application: A survey

Deep learning for 3d point clouds: A survey

Deep learning advances in computer vision with 3d data: A survey

A survey on 3d object detection methods for autonomous driving applications

Recent advances in 3d object detection in the era of deep neural networks: A survey

A review of point clouds segmentation and classification algorithms

A review of deep learning-based semantic segmentation for point cloud

Survey on semantic segmentation using deep learning techniques

A review of point cloud semantic segmentation

Deep learning for semantic segmentation of 3d point cloud

A survey on deep learning techniques for image and video semantic segmentation

A review of semantic segmentation using deep neural networks

A survey on deep learning-based fine-grained object classification and semantic segmentation

Deep semantic segmentation for automated driving: Taxonomy, roadmap and challenges

A survey of semantic segmentation

Beyond pixels: A comprehensive survey from bottom-up to semantic image segmentation and cosegmentation

A survey on semantic-based methods for the understanding of human movements

Extracting semantic information from visual data: A survey

A survey of knowledge representation in service robotics

Towards a comprehensive survey of the semantic gap in visual image retrieval

A survey of content-based image retrieval with high-level semantics

Socializing the semantic gap: A comparative survey on image tag assignment, refinement, and retrieval

Reducing semantic gap in video retrieval with fusion: A survey

Ontology based semantic search: an introduction and a survey of current approaches

A review on automatic image annotation techniques

Ontology alignment: bridging the semantic gap

Mind the gap: another look at the problem of the semantic gap in image retrieval

Review and research on" semantic gap" problem in the content based image retrieval

Vision-based robotic grasping from object localization, object pose estimation to grasp estimation for parallel grippers: a review

Affordances in robotic tasks-a survey

Robots that use language

PointNet: Deep learning on point sets for 3D classification and segmentation

Deep projective 3d semantic segmentation

Point cloud labeling using 3d convolutional neural network

SnapNet: 3D point cloud semantic labeling with 2D deep segmentation networks

3DMV: Joint 3D-multi-view prediction for 3D semantic scene segmentation

Towards total scene understanding: Classification, annotation and segmentation in an automatic framework

Object Bank: A high-level image representation for scene classification & semantic feature sparsification

Describing the scene as a whole: Joint object detection, scene classification and semantic segmentation

Image classification with the fisher vector: Theory and practice

Image Classification Model Using Visual Bag of Semantic Words

Imagenet: A large-scale hierarchical image database

Imagenet classification with deep convolutional neural networks

Learning Deep Features for Scene Recognition using Places Database

Places: A 10 Million Image Database for Scene Recognition

Object detectors emerge in deep scene cnns

Learning Deep Features for Discriminative Localization

A deep multi-modal fusion approach for semantic place prediction in social media

Aggregating rich deep semantic features for fine-grained place classification

Modality and component aware feature fusion for rgb-d scene classification

Cross modal distillation for supervision transfer

Translate-to-recognize networks for rgb-d scene recognition

Learning Visual Semantic Relationships for Efficient Visual Retrieval

Sketch-based image retrieval with deep visual semantic descriptor

Latent Semantic Minimal Hashing for Image Retrieval

Beyond instance-level image retrieval: Leveraging captions to learn a global visual representation for semantic retrieval

Saliency-based multi-feature modeling for semantic image retrieval

Distinctive image features from scale-invariant keypoints

Multiresolution gray-scale and rotation invariant texture classification with local binary patterns

Global contrast based salient region detection

Semantic Image Retrieval via Active Grounding of Visual Situations

Hierarchy-based image embeddings for semantic image retrieval

WordNet: An electronic lexical database

Sfnet: Learning object-aware semantic correspondence

Visual semantic reasoning for image-text matching

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Finding Beans in Burgers: Deep Semantic-Visual Embedding with Localization

Bi-level semantic representation analysis for multimedia event detection

Object recognition from local scale-invariant features

Surf: Speeded up robust features

Proceedings of the IEEE International Conference on Computer Vision

Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

A fast, modular scene understanding system using context-aware object detection

Complexer-yolo: Real-time 3d object detection and tracking on semantic point clouds

Semantics-aware visual object tracking

Single-Shot Object Detection with Enriched Semantics

Visual-Semantic Graph Reasoning for Pedestrian Attribute Recognition

Autolabeling 3d objects with differentiable rendering of sdf shape priors

Colour image segmentation: a state-of-the-art survey

Watersheds in digital spaces: an efficient algorithm based on immersion simulations

Mean shift: A robust approach toward feature space analysis

Inducing semantic segmentation from an example

Semantic image segmentation and object labeling

Pylon model for semantic segmentation

Class segmentation and object localization with superpixel neighborhoods

Textonboost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation

Fast approximate energy minimization via graph cuts

Fully Convolutional Networks for Semantic Segmentation

Multi-scale context aggregation by dilated convolutions

SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation

U-net: Convolutional networks for biomedical image segmentation

Enet: A deep neural network architecture for real-time semantic segmentation

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs

Gated-scnn: Gated shape cnns for semantic segmentation

Semantic Foggy Scene Understanding with Synthetic Data

Semantic segmentation with unsupervised domain adaptation under varying weather conditions for autonomous vehicles

Feature Space Optimization for Semantic Video Segmentation

The Cityscapes Dataset for Semantic Urban Scene Understanding

Temporal information integration for video semantic segmentation

STD2P: RGBD semantic segmentation using spatio-temporal data-driven pooling

Gradient and log-based active learning for semantic segmentation of crop and weed for agricultural robots

A benchmark dataset and evaluation methodology for video object segmentation

One-shot video object segmentation

Video salient object detection via fully convolutional networks

Saliency-aware video object segmentation

Learning video object segmentation from static images

Learning features by watching objects move

A probabilistic framework for real-time 3D segmentation using spatial, temporal, and semantic cues

Mask R-CNN

Deep watershed transform for instance segmentation

Segmentation-aware convolutional networks using local attention masks

Associative embedding: End-to-end learning for joint detection and grouping

Mask scoring r-cnn

Panoptic segmentation

Panoptic feature pyramid networks

Upsnet: A unified panoptic segmentation network

Fusing predictions for end-to-end panoptic segmentation

Real-time panoptic segmentation from dense detections

Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation

Axial-deeplab: Stand-alone axial-attention for panoptic segmentation

Indoor Segmentation and Support Inference from RGBD Images

Indoor Scene Understanding with RGB-D Images: Bottom-up Segmentation, Object Detection and Semantic Segmentation

Multi-modal auto-encoders as joint estimators for robotics scene understanding

Multimodal deep learning

Deep multispectral semantic scene understanding of forested environments using multimodal fusion

Learning semantic segmentation from synthetic data: A geometrically guided input-output adaptation approach

Identifying unknown instances for autonomous driving

Exploiting semantic information and deep matching for optical flow

SegStereo: Exploiting Semantic Information for Disparity Estimation

Real-time semantic stereo matching

Semantic Stereo Matching with Pyramid Cost Volumes

Semantically-guided representation learning for self-supervised monocular depth

Fast scene understanding for autonomous driving

UberNet: Training a universal convolutional neural network for Low-, Mid-, and high-level vision using diverse datasets and limited memory

Real-time joint semantic segmentation and depth estimation using asymmetric annotations

SemanticFusion: Dense 3D semantic mapping with convolutional neural networks

Predicting Polarization beyond Semantics for Wearable Robotics

The stixel world -A compact medium level representation of the 3d-world

Semantic Stixels: Depth is not enough

Content-based image retrieval at the end of the early years

A user interface for emergent semantics in image databases

Steerable filters for early vision, image analysis, and wavelet decomposition

Supervised learning of semantic classes for image annotation and retrieval

Slic superpixels compared to state-of-the-art superpixel methods

Local grayvalue invariants for image retrieval

Indexing via color histograms

Video google: A text retrieval approach to object matching in videos

A comparison of loop closing techniques in monocular slam

Holistic scene understanding for 3D object detection with RGBD cameras

Image retrieval using scene graphs

Generating Semantically Precise Scene Graphs from Textual Descriptions for Improved Image Retrieval

Scene graph generation by iterative message passing

Scene Graph Generation from Objects, Phrases and Region Captions

Neural Motifs: Scene Graph Parsing with Global Context

Mapping images to scene graphs with permutation-invariant structured prediction

Image Generation from Scene Graphs

Specifying object attributes and relations in interactive scene generation

Scene graph prediction with limited labels

Scene graph generation with external knowledge and image reconstruction

Zero-shot learning-a comprehensive evaluation of the good, the bad and the ugly

Attribute-based classification for zero-shot visual object categorization

Zero-shot learning with semantic output codes

Zero-shot learning via semantic similarity embedding

Zero-Shot Visual Recognition Using Semantics-Preserving Adversarial Embedding Networks

Generalized zero-shot learning via synthesized examples

Multi-modal cycle-consistent generalized zero-shot learning

Adaptive confidence smoothing for generalized zero-shot learning

Generalised zero-shot learning with domain classification in a joint semantic and visual space

Zero-shot learning via category-specific visual-semantic mapping and label refinement

Leveraging seen and unseen semantic relationships for generative zero-shot learning

The cognitive map in humans: spatial navigation and beyond

A robot that walks; emergent behaviors from a carefully evolved network

Vehicles: Experiments in synthetic psychology

Learning to coordinate behaviors

Efficient large-scale 3d mobile mapping and surface reconstruction of an underground mine

Visual localization within lidar maps for automated urban driving

Navigation and mapping in large unstructured environments

LaneLoc: Lane marking based localization using highly accurate maps

What localizes beneath: A metric multisensor localization and mapping system for autonomous underground mining vehicles

Discrete residual flow for probabilistic pedestrian behavior prediction

Deepsignals: Predicting intent of drivers through visual signals

Dagmapper: Learning to map by discovering lane topology

Kimera: an open-source library for real-time metric-semantic localization and mapping

SLAM++: Simultaneous localisation and mapping at the level of objects

Meaningful maps with object-oriented semantic mapping

X-View: Graph-Based Semantic Multiview Localization

LoST? Appearance-Invariant Place Recognition for Opposite Viewpoints using Visual Semantics

Large-scale semantic mapping and reasoning with heterogeneous modalities

3d dynamic scene graphs: Actionable spatial perception with places, objects, and humans

Real-time simultaneous localisation and mapping with a single camera

Monoslam: Real-time single camera slam

Parallel tracking and mapping on a camera phone

A constant-time efficient stereo slam system

Visual SLAM: why filter?" Image and Vision Computing

Orb-slam: a versatile and accurate monocular slam system

Dtam: Dense tracking and mapping in real-time

Semi-dense visual odometry for a monocular camera

SVO: fast semi-direct monocular visual odometry

Direct sparse odometry

Structured-light 3d surface imaging: a tutorial

RGB-D mapping: Using depth cameras for dense 3d modeling of indoor environments

Kinectfusion: real-time dense surface mapping and tracking

An evaluation of the RGB-D SLAM system

Very high frame rate volumetric integration of depth images on mobile devices

Real-time large-scale dense 3d reconstruction with loop closure

Visual map making for a mobile robot

Topological mapping for mobile robots using a combination of sonar and vision sensing

An integrated navigation and motion control system for autonomous multisensory mobile robots

Fast vision-guided mobile robot navigation using model-based reasoning and prediction of uncertainties

Experiments in autonomous navigation

Topological simultaneous localization and mapping (slam): toward exact localization without explicit localization

Visual odometry and map correlation

Slam-loop closing with visually salient features

Vision-based global localization and mapping for mobile robots

Fab-map: Probabilistic localization and mapping in the space of appearance

Vision-based topological mapping and localization methods: A survey

Visual Place Recognition: A Survey

A robot exploration and mapping strategy based on a semantic hierarchy of spatial representations

Learning metric-topological maps for indoor mobile robot navigation

A global topological map formed by local metric maps

An atlas framework for scalable mapping

Hybrid simultaneous localization and map building: a natural integration of topological and metric

Closing loops without places

RoboEarth Semantic Mapping: A Cloud Enabled Knowledge-Based Approach

Scene flow propagation for semantic mapping and object discovery in dynamic street scenes

A hierarchical framework for collaborative probabilistic semantic mapping

Semantic place classification of indoor environments with mobile robots using boosting

Exploiting structural properties of buildings towards general semantic mapping systems

A desicion-theoretic generalization of on-line learning and an application to boosting

Speeding-up multi-robot exploration by considering semantic place information

Learning semantic place labels from occupancy grids using CNNs

Place categorization and semantic mapping on a mobile robot

Understand scene categories by objects: A semantic regularized scene classifier using Convolutional Neural Networks

Dynamic Bayesian network for semantic place classification in mobile robotics

Applying probabilistic Mixture Models to semantic place classification in mobile robotics

Learning Deep NBNN Representations for Robust Place Categorization

Robust Place Categorization with Deep Domain Generalization

From pixels to buildings: End-to-end probabilistic deep networks for large-scale semantic mapping

PLISS: Labeling places using online changepoint detection

Visual place categorization in maps

Histogram of Oriented Uniform Patterns for robust place recognition and categorization

Model learning and real-time tracking using multi-resolution surfel maps

3D semantic map-based shared control for smart wheelchair

Building semantic object maps from sparse and noisy 3D data

Bayesian space conceptualization and place classification for semantic maps in mobile robotics

Dense reconstruction using 3d object shape priors

When 2.5D is not enough: Simultaneous reconstruction, segmentation and recognition on dense SLAM

Visual-inertial-semantic scene representation for 3D object detection

You only look once: Unified, real-time object detection

A conditional random field model for place and object classification

Geometrically consistent plane extraction for dense indoor 3D maps segmentation

Fusion++: Volumetric object-level SLAM

Toward open set recognition

Open set face recognition using transduction

Evaluation methods in face recognition

Probability models for open set recognition

Towards open world recognition

Towards open set deep networks

Incremental Object Database: Building 3D Models from Multiple Partial Observations

Volumetric Instance-Aware Semantic Mapping and 3D Object Discovery

Real-time and scalable incremental segmentation on dense SLAM

Incremental dense semantic stereo fusion for large-scale semantic scene reconstruction

Efficient object-oriented semantic mapping with object detector

Real-time progressive 3D semantic segmentation for indoor scenes

Co-fusion: Real-time segmentation, tracking and fusion of multiple objects

MaskFusion: Real-Time Recognition, Tracking and Reconstruction of Multiple Moving Objects

A unified framework for piecewise semantic reconstruction in dynamic scenes via exploiting superpixel relations

Consistent Cuboid Detection for Semantic Mapping

QuadricSLAM: Dual quadrics from object detections as landmarks in objectoriented SLAM

Structure aware slam using quadrics and planes

Real-time monocular object-model aware sparse slam

Novelty detection and 3D shape retrieval using superquadrics and multiscale sampling for autonomous mobile robots

Validation of whole-body loco-manipulation affordances for pushability and liftability

A scene graph based shared 3D world model for robotic applications

Categorizing object-action relations from semantic scene graphs

Graph-based visual semantic perception for humanoid robots

3-D Scene Graph: A Sparse and Semantic Representation of Physical Environments for Intelligent Agents

Semantic Robot Programming for Goal-Directed Manipulation in Cluttered Scenes

PHIGS: A Standard, Dynamic, Interactive Graphics Interface

Towards a Domain Specific Language for a Scene Graph based Robotic World Model

COSMO: Contextualized scene modeling with Boltzmann Machines

Spatio-temporal graph for video captioning with knowledge distillation

Growing semantically meaningful models for visual slam

A dynamic programming approach to reconstructing building interiors

Manhattan scene understanding using monocular, stereo, and 3d features

Joint semantic segmentation and 3D reconstruction from monocular video

Dense 3D semantic mapping of indoor scenes from RGB-D images

Dense real-time mapping of object-class semantics from RGB-D video

Semi-dense 3d semantic mapping from monocular slam

ElasticFusion: Dense SLAM without a pose graph

DA-RNN: Semantic mapping with Data Associated Recurrent Neural Networks

Semantic 3D occupancy mapping through efficient high order CRFs

FuseNet: Incorporating depth into semantic segmentation via fusion-based CNN architecture

Learning rich features from rgb-d images for object detection and segmentation

Multi-view deep learning for consistent semantic mapping with RGB-D cameras

Lsd-slam: Large-scale direct monocular slam

Dense monocular reconstruction using surface normals

Towards semantic maps for mobile robots

Using context to create semantic 3D models of indoor environments

Dense 3d visual mapping via semantic simplification

Microsoft kinect sensor and its effect

Fast semantic segmentation of 3D point clouds using a dense CRF with learned parameters

Voxnet: A 3d convolutional neural network for real-time object recognition

Learning where to classify in multi-view semantic segmentation

3D semantic parsing of large-scale indoor spaces

Capturing and aligning multiple 3-dimensional scenes

Unsupervised feature learning for classification of outdoor 3d scans

3d shapenets: A deep representation for volumetric shapes

Convolutional-recursive deep learning for 3d object classification

An occlusion-aware feature for range images

Sliding shapes for 3d object detection in rgb-d images

3d object recognition using convolutional neural networks with transfer learning between input channels

Deep learning for detecting robotic grasps

Vehicle detection from 3d lidar using fully convolutional network

Fast semantic segmentation of rgb-d scenes with gpu-accelerated deep neural networks

A fast learning algorithm for deep belief nets

3d u-net: learning dense volumetric segmentation from sparse annotation

Gradient-based learning applied to document recognition

Volumetric and multi-view cnns for object classification on 3d data

Segcloud: Semantic segmentation of 3d point clouds

Semantic scene completion from a single depth image

Shape completion using 3D-encoder-predictor CNNs and shape synthesis

ScanComplete: Large-Scale Scene Completion and Semantic Segmentation for 3D Scans

OctNet: Learning Deep 3D Representations at High Resolutions Deep Learning for 3D Data Shape Classification

Geometric modeling using octree encoding

Sparse 3d convolutional neural networks

Vote3deep: Fast object detection in 3d point clouds using efficient convolutional neural networks

3d semantic segmentation with submanifold sparse convolutional networks

Spatially-sparse convolutional neural networks

Voting for voting in online point cloud object detection

Deep sparse rectifier neural networks

Learning sparse high dimensional filters: Image filtering, dense crfs and bilateral neural networks

Fpnn: Field probing neural networks for 3d data

Multi-view convolutional neural networks for 3d shape recognition

Tangent convolutions for dense prediction in 3d

Escape from cells: Deep kd-networks for the recognition of 3d point cloud models

PointNet++: Deep hierarchical feature learning on point sets in a metric space

Exploring Spatial Context for 3D Semantic Segmentation of Point Clouds

Pointwise convolutional neural networks

Pointconv: Deep convolutional networks on 3d point clouds

3d recurrent neural networks with context fusion for point cloud semantic segmentation

Recurrent slice networks for 3d segmentation of point clouds

Rgcnn: Regularized graph cnn for point cloud segmentation

Dynamic graph cnn for learning on point clouds

Foldingnet: Point cloud auto-encoder via deep grid deformation

Adversarial autoencoders for compact representations of 3d point clouds

Dynamic edge-conditioned filters in convolutional neural networks on graphs

3D Graph Neural Networks for RGBD Semantic Segmentation

3dcontextnet: Kd tree guided hierarchical learning of point clouds using local and global contextual cues

Large-scale point cloud semantic segmentation with superpoint graphs

Multidimensional binary search trees used for associative searching

So-net: Self-organizing network for point cloud analysis

The self-organizing map

Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras

Structure-from-motion revisited

Deep non-rigid structure from motion

Sfm-net: Learning of structure and motion from video

Two-stream convolutional networks for action recognition in videos

Learning realistic human actions from movies

Action MACH: a spatio-temporal maximum average correlation height filter for action recognition

UCF101: A dataset of 101 human actions classes from videos in the wild

Quo vadis, action recognition? a new model and the kinetics dataset

Finding action tubes

ActivityNet: A large-scale video benchmark for human activity understanding

Scaling egocentric vision: The EPIC-Kitchens dataset

Action recognition by dense trajectories

Actom sequence models for efficient action detection

Activity representation with motion hierarchies

Rank pooling for action recognition

Learning end-to-end video classification with rank-pooling

Discriminative hierarchical rank pooling for activity recognition

Dynamic image networks for action recognition

Automatic annotation of everyday movements

An approach to pose-based action recognition

2d/3d pose estimation and action recognition using multitask deep learning

Semantic-level understanding of human actions and interactions using event hierarchy

Towards zero-shot learning for human activity recognition using semantic attribute sequence model

NuActiv: Recognizing unseen new activities using semantic attributebased learning

Automatic segmentation and recognition of human activities from observation based on semantic reasoning

Bootstrapping humanoid robot skills by extracting semantic representations of human-like activities from virtual reality

Robust semantic representations for inferring human co-manipulation activities even with different demonstration styles

Enhancing human action recognition through spatio-temporal feature learning and semantic rules

Transferring skills to humanoid robots by extracting semantic representations from observations of human activities

Added value of gaze-exploiting semantic representation to allow robots inferring human behaviors

A Semantic-Based Method for Teaching Industrial Robots New Tasks

3D Semantic Trajectory Reconstruction from 3D Pixel Continuum

Kinematic Structure Correspondences via Hypergraph Matching

Learning Kinematic Structure Correspondences Using Multi-Order Similarities

Computer Vision for Lifelogging: Characterizing Everyday Activities Based on Visual Semantics

The ecological approach to visual perception: classic edition

Learning about objects through action-initial steps towards artificial cognition

To afford or not to afford: A new formalization of affordances toward affordance-based robot control

From primitive behaviors to goal-directed behavior using affordances

Object-action complexes: Grounded abstractions of sensory-motor processes

Learning the semantics of object-action relations by observation

Toward a library of manipulation actions based on semantic object-action relations

Model-free incremental learning of the semantics of manipulation actions

Robot learning manipulation action plans by" watching" unconstrained videos from the world wide web

Templates for pre-grasp sliding interactions

Afrob: The affordance network ontology for robots

Grasp quality measures: review and performance

Robotic grasping and contact: A review

Data-driven grasp synthesis-a survey

Learning grasp affordance densities

Refining grasp affordance models by experience

Ew jr., a. rodriguez, and jx xiao, multi-view self-supervised deep learning for 6d pose estimation in the amazon picking challenge

Integrated grasp planning and visual object localization for a humanoid robot with five-fingered hands

Deep object pose estimation for semantic robotic grasping of household objects

Rigid 3d geometry matching for grasping of known objects in cluttered scenes

The moped framework: Object recognition and pose estimation for manipulation

Semantic grasping: Planning robotic grasps functionally suitable for an object manipulation task

Transferring functional grasps through contact warping and local replanning

Generalizing grasps across partly similar objects

Learning a dictionary of prototypical grasp-predicting parts from grasping experience

Localizing handle-like grasp affordances in 3d point clouds

Learning to detect visual grasp affordance

kPAM: Keypoint affordances for category-level robotic manipulation

Closing the Loop for Robotic Grasping: A Real-time, Generative Grasp Synthesis Approach

Learning robust, real-time, reactive robotic grasping

Dex-Net 2.0: Deep Learning to Plan Robust Grasps with Synthetic Point Clouds and Analytic Grasp Metrics

Learning ambidextrous robot grasping policies

Grasp Pose Detection in Point Clouds

Trends and challenges in robot manipulation

Cartman: The low-cost cartesian manipulator that won the amazon robotics challenge

Robotic pick-andplace of novel objects in clutter with multi-affordance grasping and cross-domain image matching

Fast object learning and dual-arm coordination for cluttered stowing, picking, and packing

Object discovery and grasp detection with a shared convolutional neural network

End-to-end learning of semantic grasping

Cortical dynamics of sensorimotor integration during grasp planning

Affordance detection of tool parts from geometric features

Detecting object affordances with convolutional neural networks

Affordance detection for task-specific grasping using deep learning

Multivariate discretization for bayesian network structure learning in robot grasping

Task-based robot grasp planning using probabilistic inference

Learning human priors for task-constrained grasping

Towards robust grasps: Using the environment semantics for robotic object affordances

Task-oriented grasping with semantic and geometric scene understanding

Autonomous semantic mapping for robots performing everyday manipulation tasks in kitchen environments

Context-aware robot navigation using interactively built semantic maps

Inferring robot goals from violations of semantic knowledge

What can i do with this tool? self-supervised learning of tool affordances from their 3-d geometry

Learning task-oriented grasping for tool manipulation from simulated self-supervision

Asking for Help Using Inverse Semantics

Understanding natural language commands for robotic navigation and mobile manipulation

Temporal Spatial Inverse Semantics for Robots Communicating with Humans

Robot program construction via grounded natural language semantics & simulation robotics track

Semantic reasoning in service robots using expert systems

Incremental semantically grounded learning from demonstration

Unsupervised perceptual rewards for imitation learning

Efficient model learning from joint-action demonstrations for human-robot collaborative tasks

Learning spatial-semantic representations from natural language descriptions and scene classifications

Visual semantic navigation using scene priors

Exploration with active loop-closing for fastslam

An application of kullback-leibler divergence to active slam and exploration with particle filters

Active slam and exploration with particle filters using kullback-leibler divergence

Artificial cognition for social human-robot interaction: An implementation

A human aware mobile robot motion planner

Simulation-based behavior planning to prevent congestion of pedestrians around a robot

Lingodroids: Studies in spatial cognition and language

Computational modelling of embodied visual perspective-taking

Markerless Perspective Taking for Humanoid Robots in Unconstrained Environments

Using perspective taking to learn from ambiguous demonstrations

Vision-andlanguage navigation: Interpreting visually-grounded navigation instructions in real environments

Speaker-follower models for vision-and-language navigation

Learning to navigate unseen environments: Back translation with environmental dropout

Self-monitoring navigation agent via auxiliary progress estimation

The regretful agent: Heuristic-aided navigation through progress estimation

Tactical rewind: Self-correction via backtracking in vision-and-language navigation

Reverie: Remote embodied visual referring expression in real indoor environments

Simultaneous place and object recognition using collaborative context information

Simultaneous place and object recognition with mobile robot using pose encoded contextual information

Building the gist of a scene: The role of global image features in recognition

Localization from semantic observations via the matrix permanent

Object detection with discriminatively trained part-based models

Semantic signatures for urban visual localization

A Coarse to Fine Indoor Visual Localization Method Using Environmental Semantic Information

Learning of Holism-Landmark graph embedding for place recognition in Long-Term autonomy

GIS-assisted object detection and geospatial localization

Semantic cross-view matching

Visual map matching and localization using a global feature map

A new approach to linear filtering and prediction problems

Learning to align semantic segmentation and 2.5 d maps for geolocalization

Semantic Image Based Geolocation Given a Map

Development of positioning technique using omni-directional ir camera and aerial survey data

Dynamic programming and skyline extraction in catadioptric infrared images

Sky segmentation with ultraviolet images can be used for navigation

Skyline-based localisation for aggressively manoeuvring robots using uv sensors and spherical harmonics

Skyline2gps: Localization in urban canyons using omni-skylines

Image based geo-localization in the alps

Routed roads: Probabilistic vision-based place recognition for changing conditions, split streets and varied viewpoints

Geolocating static cameras

VLASE: Vehicle Localization by Aggregating Semantic Edges

Aggregating local image descriptors into compact codes

Casenet: Deep category-aware semantic edge detection

Semantically Guided Geo-location and Modeling in Urban Environments

Learning a classification model for segmentation

Decomposing a scene into geometric and semantically consistent regions

Addressing challenging place recognition tasks using generative adversarial networks

Adversarial training for adverse conditions: Robust metric localisation using appearance transfer

Night-to-day image translation for retrieval-based localization

Experience-based navigation for long-term localisation

Scalable place recognition under appearance change for autonomous driving

The gist of maps-summarizing experience for lifelong localization

Google street view: Capturing the world at street level

A2d2: Audi autonomous driving dataset

Lyft level 5 av dataset 2019

Mapillary street-level sequences: A dataset for lifelong place recognition

The global network of outdoor webcams: properties and applications

Robust visual place recognition under simultaneous variations in viewpoint and appearance

Long-Term 3D Localization and Pose from Semantic Labellings

Improving condition-and environment-invariant place recognition with semantic place categorization

Seqslam: Visual route-based navigation for sunny summer days and stormy winter nights

Semantics-aware visual localization under challenging perceptual conditions

Cross-view image geolocalization

Semantic-geometric visual place recognition: a new perspective for reconciling opposing views

Netvlad: Cnn architecture for weakly supervised place recognition

Visual Semantic Navigation Based on Deep Learning for Indoor Mobile Robots

Semantic localization in the PCL library

Edge boxes: Locating object proposals from edges

Place recognition with convnet landmarks: Viewpoint-robust, condition-robust, training-free

A robust semi-semantic approach for visual localization in urban environment

Semantically-aware attentive neural embeddings for imagebased visual localization

On the performance of convnet features for place recognition

Don't look back: Robustifying place categorization for viewpoint-and condition-invariant place recognition

Distinctive 3d surface entropy features for place recognition

Place recognition based on matching of planar surfaces and line segments

Point cloud descriptors for place recognition using sparse visual information

Look no deeper: Recognizing places from opposing viewpoints under varying scene appearance using single-view depth estimation

Real-time wide-baseline place recognition using depth completion

Unsupervised monocular depth estimation for night-time images using adversarial domain feature adaptation

Lcd-line clustering and description for place recognition

24/7 place recognition by view synthesis

Fine-grained segmentation networks: Self-supervised segmentation for improved long-term visual localization

The mapillary vistas dataset for semantic understanding of street scenes

The limits and potentials of deep learning for robotics

3d convolutional neural networks for landing zone detection from lidar

Rt3d: Real-time 3-d vehicle detection in lidar point cloud for autonomous driving

R-fcn: Object detection via region-based fully convolutional networks

Noise-resistant deep learning for object classification in threedimensional point clouds using a point pair descriptor

Surflet-pair-relation histograms: a statistical 3d-shape representation for rapid classification

Fast point feature histograms (fpfh) for 3d registration

Model globally, match locally: Efficient and robust 3d object recognition

Ensemble of shape functions for 3d object classification

Point pair features based object detection and pose estimation revisited

Robust monocular slam in dynamic environments

Slam with objects using a nonparametric pose graph

Perspective-2-ellipsoid: Bridging the gap between object detections and 6-dof camera pose

Towards self-supervised semantic representation with a viewpoint-dependent observation model

Robust object-based slam for high-speed autonomous navigation

Rgb-d object slam using quadrics for indoor environments

Diagnosing error in object detectors

What could move? finding cars, pedestrians and bicyclists in 3d laser data

Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving

Pseudo-lidar++: Accurate depth for 3d object detection in autonomous driving

Refinedmpl: Refined monocular pseudolidar for 3d object detection in autonomous driving

End-to-end pseudo-lidar for image-based 3d object detection

A practical bayesian framework for backpropagation networks

Dropout as a bayesian approximation: Representing model uncertainty in deep learning

Simple and scalable predictive uncertainty estimation using deep ensembles

A simple baseline for bayesian uncertainty in deep learning

Dropout sampling for robust object detection in open-set conditions

Probabilistic object detection: Definition and evaluation

Inferring spatial uncertainty in object detection

Bayesod: A bayesian approach for uncertainty estimation in deep object detectors

Efficient uncertainty estimation for semantic segmentation in videos

Uncertainty measures and prediction quality rating for the semantic segmentation of nested multi resolution street scene images

Performance monitoring of object detection during deployment

The Fishyscapes benchmark: Measuring blind spots in semantic segmentation

Multi-modal semantic place classification

Hierarchical Multi-Modal Place Categorization

Categorization of indoor places using the Kinect sensor

Local N-ary Patterns: A local multi-modal descriptor for place categorization

Gray scale and rotation invariant texture classification with local binary patterns

Deep multimodal learning: A survey on recent advances and trends

Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges

3d object proposals for accurate object class detection

Multi-view 3d object detection network for autonomous driving

Joint 3d proposal generation and object detection from view aggregation

Pointpainting: Sequential fusion for 3d object detection

Deep continuous fusion for multi-sensor 3d object detection

Fusing bird's eye view lidar point cloud and front view camera image for 3d object detection

Frustum pointnets for 3d object detection from rgb-d data

Ipod: Intensive point-based object detector for point cloud

Frustum convnet: Sliding frustums to aggregate local point-wise features for amodal

Placer: Semantic place labels from diary data

Placer++: Semantic place labels beyond the visit

The discovery of personally semantic places based on trajectory data mining

Sensor fusion for semantic place labeling

Semantic place classification and mapping for autonomous agricultural robots

Roomba i series. iRobot

Robomow Friendly House

Auv sentry. Woods Hole Oceanographic Institution

The design and 200 day per year operation of the autonomous underwater vehicle sentry

Mars curiosity rover. Nasa Science Mars Exploration Program

Mars Rover Curiosity: An Inside Account from Curiosity's Chief Engineer

An intelligent robotic hospital bed for safe transportation of critical neurosurgery patients along crowded hospital corridors

Social robots for older adults: Framework of activities for aging in place with robots

Three-dimensional imaging improves surgical performance for both novice and experienced operators using the da vinci robot system

The when, where, and how: An adaptive robotic info-terminal for care home residents

Sam, an assistive robotic device dedicated to helping persons with quadriplegia: Usability study

Suitable Technologies, Inc

Long-term assessment of a service robot in a hotel environment

Product counting using images with application to robot-based retail stock assessment

A communication robot in a shopping mall

Toomas: interactive shopping guide robots in everyday use-final implementation and experiences from long-term field trials

A tea-serving robot for office environment

Robotics in ecommerce logistics

Starship Technologies

Robots -your guide to the world of robots

Are we ready for autonomous driving? the kitti vision benchmark suite

The unmanned aerial vehicle benchmark: Object detection and tracking

Vision meets drones: A challenge

Uavid: A semantic segmentation dataset for uav imagery

Results of the isprs benchmark on urban object detection and 3d building reconstruction

Processing of extremely high-resolution lidar and rgb data: outcome of the 2015 ieee grss data fusion contest-part a: 2-d contest

Hyperspectral and lidar data fusion: Outcome of the 2013 grss data fusion contest

Deepglobe 2018: A challenge to parse the earth through satellite images

Ensemble knowledge transfer for semantic segmentation

Unsupervised domain adaptation using generative adversarial networks for semantic segmentation of aerial images

Generative adversarial nets

Automatic detection, classification and tracking of objects in the ocean surface from uavs using a thermal camera

Bi-heterogeneous convolutional neural network for uav-based dynamic scene classification

Nature conservation drones for automatic localization and counting of animals

Detection of cattle using drones and convolutional neural networks

Spot poachers in action: Augmenting conservation drones with automatic detection in near real time

Construction site management and control technology based on uav visual surveillance

Wing Aviation LLC

New opportunities for forest remote sensing through ultra-high-density drone lidar

Cad2rl: Real single-image flight without a single real image

Dronet: Learning to fly by driving

Learning to fly by crashing

Learning modular neural network policies for multi-task and multi-robot transfer

Policy transfer via modularity and reward guiding

Driving policy transfer via modularity and abstraction

Deep drone racing: From simulation to reality with domain randomization

Beauty and the beast: Optimal methods meet learning for drone racing

A 64-mw dnn-based visual navigation engine for autonomous nano-drones

A real-time game theoretic planner for autonomous two-player drone racing

Towards simulating semantic onboard uav navigation

A discourse on winning and losing [briefing slides

A discourse on winning and losing

Towards semantic context-aware drones for aerial scenes understanding

Semantically enhanced uavs to increase the aerial scene understanding

Relationship between uavs and ambient objects with threat situational awareness through grid map-based ontology reasoning

A survey of human-centered intelligent robots: issues and challenges

Autonomous navigation by mobile robots in human environments: a survey

A need for service robots among health care professionals in hospitals and housing services

Monitoring the acceptance of a social service robot in a shopping mall: first results

Consumer evaluation of hotel service robots

Challenges for service robots-requirements of elderly adults with cognitive impairments

A review of service robots coping with uncertain information in natural language instructions

Service robots in the hospitality industry: An exploratory literature review

A review on service robots: Mechanical design and localization system

Service robots in hospitals: new perspectives on niche evolution and technology affordances

Social acceptance of robots in different occupational fields: A systematic literature review

Robots in welfare services: A systematic literature review

Service robots and the changing roles of employees in restaurants: A cross cultural study

Robots or frontline employees?: Exploring customers' attributions of responsibility and stability after service failure or success

Desiderata for planning systems in general-purpose service robots

Fraunhofer Institute for Manufacturing Engineering and Automation

Robot operating system (ros)

Accurate pouring with an autonomous robot using an rgb-d camera

A robot path planning framework that learns from experience

Multisensor-based human detection and tracking for mobile service robots

An intelligent spraying system with deep learning-based semantic segmentation of fruit trees in orchards

Semantic linking maps for active visual object search

From visual surveillance to internet of things: technology and applications

The Routledge Handbook of Technology, Crime and Justice

Visual surveillance of natural environments: Background subtraction challenges and methods

A survey of using biometrics for smart visual surveillance: Gait recognition

Anomaly detection in road traffic using visual surveillance: A survey

Video analytics for visual surveillance and applications: An overview and survey

Applications of intelligent video analytics in the field of retail management: A study," in Supply Chain Management Strategies and Risk Assessment in Retail Environments

Suspicious human activity recognition: a review

Application of an adaptive background model for monitoring honeybees

Automated visual monitoring of nesting seabirds

A texton-based kernel density estimation approach for background modeling under extreme conditions

Position and direction estimation of wolf spiders, pardosa astrigera, from video images

Visual surveillance of human activities: Background subtraction challenges and methods

Microscopic modelling of area-based heterogeneous traffic flow: Area selection and vehicle movement

Robust segmentation process to detect incidents on highways

Automatic parking system using background subtraction with cctv environment international conference on control, automation and systems (iccas 2016)

Vanishing point detection for visual surveillance systems in railway platform environments

Towards automated visual surveillance using gait for identity recognition and tracking across multiple non-intersecting cameras

Single image defogging based on illumination decomposition for visual maritime surveillance

Rule based visual surveillance system for the retail domain

Gait recognition using motion trajectory analysis

Mars: A video benchmark for large-scale person re-identification

Attention-aware compositional network for person re-identification

Attention-aware deep reinforcement learning for video face recognition

Deep face recognition: A survey

Face recognition: A survey

The sixth visual object tracking vot2018 challenge results

Mot16: A benchmark for multi-object tracking

Virtual worlds as proxy for multi-object tracking analysis

A hierarchical frame-by-frame association method based on graph matching for multi-object tracking

Mot20: A benchmark for multi object tracking in crowded scenes

Abandoned or removed object detection from visual surveillance: a review

A survey of deep learning techniques for autonomous driving

When to use what data set for your self-driving car algorithm: An overview of publicly available driving datasets

Computer vision for autonomous vehicles: Problems, datasets and state of the art

Three decades of driver assistance systems: Review and future perspectives

nuscenes: A multimodal dataset for autonomous driving

Scalability in perception for autonomous driving: Waymo open dataset

Ai for full-self driving. Matroid. 5th Annual Scaled Machine Learning Conference

P1-007: How automated vehicles will interact with road infrastructure now and in the future

Autonomous driving interview with michael fausten. Bosch Global

Monocular 3d object detection for autonomous driving

Reinforcement learning based control of imitative policies for near-accident driving

Segvoxelnet: Exploring semantic context and depthaware features for 3d vehicle detection from point cloud

3d fully convolutional network for vehicle detection in point cloud

Multi-task multi-sensor fusion for 3d object detection

End-to-end interpretable neural motion planner

Jointly learnable behavior and trajectory planning for self-driving vehicles

End-to-end learning of multi-sensor 3d tracking by detection

It is not the journey but the destination: Endpoint conditioned trajectory prediction

Spatiotemporal relationship reasoning for pedestrian intent prediction

Intentnet: Learning to predict intention from raw sensor data

Segmenting 2k-videos at 36.5 fps with 24.3 gflops: Accurate and lightweight realtime semantic segmentation network

Efficient convolutions for real-time semantic segmentation of 3d point clouds

Pixor: Real-time 3d object detection from point clouds

Fast and furious: Real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net

Online camera-lidar calibration with sensor semantic information

Learning to localize through compressed binary maps

Learning to localize using a lidar intensity map

Deep multi-sensor lane detection

Matching adversarial networks

Torontocity: Seeing the world with a million eyes

Hierarchical recurrent attention networks for structured online maps

Convolutional recurrent network for road boundary extraction

End-to-end deep structured models for drawing crosswalks

Deeproadmapper: Extracting road topology from aerial images

Hdnet: Exploiting hd maps for 3d object detection

Exploiting sparse semantic hd maps for self-driving vehicle localization

New trends on moving object detection in video images captured by a moving camera: A survey

Police body-worn cameras

Argus: realistic target coverage by drones

Domestic drones: the politics of verticality and the surveillance industrial complex

Fully convolutional siamese autoencoder for change detection in uav aerial images

Effective visual surveillance of human crowds using cooperative unmanned vehicles

Room searching performance evaluation for the jagabottm indoor surveillance robot

Design and implementation of surveillance robot for outdoor security

Robots in crisis management: A survey

Arbitrary viewpoint visualization for disaster response robots

Armatron-a wearable gesture recognition glove: For control of robotic devices in disaster management and human rehabilitation

Semantic data exchange between collaborative robots in fog environment: Can coap be a choice

Drone-assisted disaster management: Finding victims via infrared camera and lidar sensor fusion

A survey of augmented reality

The history of mobile augmented reality

The most important challenge facing augmented reality

A survey of augmented reality technologies, applications and limitations

Augmented reality trends in education: A systematic review of research and applications

Institute for computer graphics and vision

Recent developments and future challenges in medical mixed reality

An overview of augmented and virtual reality applications in radiotherapy and future developments enabled by modern tablet devices

The state of the art of visualization in mixed reality image guided surgery

Augmented reality in neurosurgery: a systematic review

Visualization techniques for augmented reality in endoscopic surgery

The status of augmented reality in laparoscopic surgery as of 2016

A see through future: Augmented reality and health information systems

Augmented reality in healthcare education: an integrative review

Revisiting trends in augmented reality research: A review of the 2nd decade of ismar

Trends in augmented reality tracking, interaction and display: A review of ten years of ismar

Recent advances in augmented reality

A survey of augmented reality

A head-mounted three dimensional display

A motion-stabilized outdoor augmented reality system

A touring machine: Prototyping 3d mobile augmented reality systems for exploring the urban environment

Authoring of physical models using mobile computers

Exploring mars: developing indoor and outdoor user interfaces to a mobile augmented reality system

Tinmith-metro: New outdoor techniques for creating city models with an augmented reality wearable computer

A wearable computer system with augmented reality to support terrestrial navigation

Townwear: An outdoor wearable mr system with high-precision registration

Registration for outdoor augmented reality applications using computer vision techniques and hybrid sensors

Hybrid tracking for outdoor augmented reality applications

Combining edge and texture information for real-time accurate 3d camera tracking

Robust visual tracking for non-instrumental augmented reality

Sensor fusion and occlusion refinement for tablet-based ar

Going out: robust model-based tracking for outdoor augmented reality

Kinectfusion: real-time 3d reconstruction and interaction using a moving depth camera

Dense planar slam

Elasticfusion: Real-time dense slam and light source estimation

Experimenting with electromagnetism using augmented reality: Impact on flow student experience and educational effectiveness

Ikea place demo ar app. Inter IKEA Systems B

Charlotte tilbury westfield london magic mirror. Holition

Take off to your next destination with google maps. Google

Projection of subsurface structure onto an object's surface

Architectural design of a real-time augmented feedback system for neuromotor rehabilitation

Integrating haptics with augmented reality in a femoral palpation and needle insertion training simulation

Training for planning tumour resection: augmented reality and human factors

Surgical navigation technology based on augmented reality and integrated 3d intraoperative imaging: a spine cadaveric feasibility and accuracy study

Dense-arthroslam: Dense intra-articular 3-d reconstruction with robust localization prior for arthroscopy

Slam-based dense surface reconstruction in monocular minimally invasive surgery and its application to augmented reality

Real-time geometry-aware augmented reality in minimally invasive surgery

Marker tracking and hmd calibration for a video-based augmented reality conferencing system

Vuforia engine developer portal

VisionStar Information Technology (Shanghai) Co., Ltd

Comparative design space exploration of dense and semi-dense slam

Integrating algorithmic parameters into benchmarking and design space exploration in 3d scene understanding

Algorithmic performance-accuracy trade-off in 3d vision applications using hypermapper

Navigating the landscape for real-time localization and mapping for robotics and virtual and augmented reality

Practical design space exploration

A general-purpose processor-per-pixel analog simd vision chip

Vision chips

Focal-plane sensor-processor chips

Low power high-performance smart camera system based on scamp vision sensor

Vision chips with in-pixel processors for high-performance low-power embedded vision systems

Analog vision-neural network inference acceleration using analog simd computation in the focal plane

Neuromorphic electronic systems

Computational sensors-vision vlsi

Field programmable gate array applications-a scientometric review

A survey of fpga-based accelerators for convolutional neural networks

A survey and taxonomy of fpga-based deep learning accelerators

A survey of fpga based deep learning accelerators: Challenges and opportunities

Fpga-based accelerators of deep learning networks for learning and classification: A review

Toolflows for mapping convolutional neural networks on fpgas: A survey and future directions

Neuromorphic vision chips

Neuromorphic vision chips

A survey of neuromorphic computing and neural networks in hardware

Event-based row-by-row multi-convolution engine for dynamic-vision feature extraction on fpga

Energy-efficient neuron, synapse and stdp integrated circuits

Spiking neural network on neuromorphic hardware for energy-efficient unidimensional slam

Flyintel-a platform for robot navigation based on a brain-inspired spiking neural network

A gpu-accelerated cortical neural network model for visually guided robot navigation

Serendipitous offline learning in a neuromorphic robot

Going deeper in spiking neural networks: Vgg and residual architectures

Homeostasis-based cnn-to-snn conversion of inception and residual architectures

Spiking deep convolutional neural networks for energy-efficient object recognition

SUN RGB-D: A RGB-D scene understanding benchmark suite

SceneNN: A scene meshes dataset with aNNotations

Matterport3d: Learning from rgb-d data in indoor environments

Disentangling human dynamics for pedestrian locomotion forecasting with noisy supervision

Sne-roadseg: Incorporating surface normal information into semantic segmentation for accurate freespace detection

Unsupervised domain adaptation for semantic segmentation of nir images through generative latent search

Robot@Home, a robotic dataset for semantic mapping of home environments

Methods and datasets on semantic segmentation: A review

Awesome semantic segmentation. GitHub

ScanNet: Richly-annotated 3D reconstructions of indoor scenes

Semantic3D.Net: a New Large-Scale Point Cloud Classification Benchmark

Joint 2d-3d-semantic data for indoor scene understanding

Knowledge graph embedding: A survey of approaches and applications

Knowledge graph refinement: A survey of approaches and evaluation methods

Owl web ontology language overview

Owl-s: Semantic markup for web services

Sosa: A lightweight ontology for sensors, observations, samples, and actuators

Towards a core ontology for robotics and automation

Using ontologies to aid navigation planning in autonomous vehicles

Map representation of a large in-door environment with path planning and navigation abilities for an autonomous mobile robot with its implementation on a real robot

Knowledge representation and planning for on-road driving

Modeling ontologies for robotic environments

How task analysis can be used to derive and organize the knowledge for the control of autonomous vehicles

Representation and purposeful autonomous agents

Metaknowledge for autonomous systems

An ontologybased representation for policy-governed adjustable autonomy

A robot ontology for urban search and rescue

Control architecture concepts and properties of an ontology devoted to exchanges in mobile robotics

An ontology of robotics science

What is a knowledge representation

A review of long-term memory in natural and synthetic systems

Dac-h3: A proactive robot cognitive architecture to acquire and express knowledge about the world and the self

Oro, a knowledge management platform for cognitive architectures in robotics

Knowrob-knowledge processing for autonomous personal robots

Ontology based knowledge representation technique, domain modeling languages and planners for robotic path planning: A survey

Multi-hierarchical semantic maps for mobile robotics

Neoclassic reference manual: Version 1.0

Conceptual spatial representations for indoor mobile robots

Knowrob-map-knowledge-linked semantic object maps

Cyc: A large-scale investment in knowledge infrastructure

Common sense data acquisition for indoor mobile robots

Rapyuta: A cloud robotics platform

Towards semantic robot description languages

Building multiversal semantic maps for mobile robot operation

Semantic image search for robotic applications

Healthaid: Extracting domain targeted high precision procedural knowledge from on-line communities

Things are made for what they are: Solving manipulation tasks by using functional object classes

Ppddl1. 0: An extension to pddl for expressing planning domains with probabilistic effects

Probabilistic effect prediction through semantic augmentation and physical simulation

Web-enabled robots

Robotic roommates making pancakes

Generalizing objects by analyzing language

Cloud-enabled humanoid robots

Beyond Webcams: an introduction to online robots

Desktop teleoperation via the world wide web

What is networked robotics?" in Informatics in Control Automation and Robotics

Open-source hardware

That 'internet of things' thing

The internet of things: A survey

Cloud robotics and automation: A survey of related work

A survey of research on cloud robotics and automation

Cloud robotics: Current status and open issues

Cloud robotics: architecture, challenges and applications

Rao-blackwellized particle filters multi robot slam with unknown initial correspondences and limited communication

Rao-blackwellised particle filtering for dynamic bayesian networks

An approach towards survey and analysis of cloud robotics

A comprehensive survey of recent trends in cloud robotics architectures and applications

A study of robotic cooperation in cloud robotics: Architecture and challenges

Service-oriented software architecture for cloud robotics

Network offloading policies for cloud robotics: a learning-based approach

Cloud robotics law and regulation: Challenges in the governance of complex and dynamic cyber-physical ecosystems

Cloud, fog, and dew robotics: Architectures for next generation applications

Parallel tracking and mapping for small ar workspaces

Codeslam-learning a compact, optimisable representation for dense visual slam

C2tam: A cloud framework for cooperative tracking and mapping

Cloud-based framework for scalable and real-time multi-robot slam

Multi robot object-based slam

Multiple relative pose graphs for robust cooperative mapping

Distributed multirobot exploration and mapping

A reinforcement learning-based resource allocation scheme for cloud robotics

Dronetrack: Cloud-based real-time object tracking using unmanned aerial vehicles over the internet

Dronemap planner: A service-oriented cloud-based management system for the internet-of-drones

Fastslam 2.0: An improved particle filtering algorithm for simultaneous localization and mapping that provably converges

Fastslam 2.0 tracking and mapping as a cloud robotics service

Cloud robot: semantic map building for intelligent service task

The Apache Software Foundation

Decentralized robot-cloud architecture for an autonomous transportation system in a smart factory

From sensor to processing networks: Optimal estimation with computation and communication latency

Tree-cnn: a hierarchical deep convolutional neural network for incremental learning

Class-incremental learning via deep model consolidation

Fast, compact and highly scalable visual place recognition through sequence-based matching of overloaded representations

Pre-training tasks for embedding-based large-scale retrieval

Efficient covisibility-based image matching for large-scale sfm

Graph-based parallel large scale structure from motion

One ring to rule them all: Certifiably robust geometric perception with outliers

Graduated non-convexity for robust spatial perception: From non-minimal solvers to global outlier rejection

Robotic services for new paradigm smart cities based on decentralized technologies

Smart city and iot

Role of expressive behaviour for robots that learn from people

A 2 ML: A general human-inspired motion language for anthropomorphic arms based on movement primitives

Towards the development of emotional dancing humanoid robots

Tonight we improvise!: Real-time tracking for human-robot improvisational dance

Effective emotional expressions with expression humanoid robot we-4rii: integration of humanoid robot hand rch-1

Emotion-preserving representation learning via generative adversarial network for multi-view facial expression recognition

Expression of emotions by a service robot: a pilot study

Robot navigation in unseen spaces using an abstract map

Object-oriented semantic graph based natural question generation