key: cord-0472579-rof65n04 authors: Grassi, Lucrezia; Recchiuto, Carmine Tommaso; Sgorbissa, Antonio title: Sustainable Verbal and Non-verbal Human-Robot Interaction Through Cloud Services date: 2022-03-04 journal: nan DOI: nan sha: 0397d3f10dd91ea0f38c1996adbe1304bcfbbaa2 doc_id: 472579 cord_uid: rof65n04 This article presents the design and the implementation of CAIR: a cloud system for knowledge-based autonomous interaction devised for Social Robots and other conversational agents. The system is particularly convenient for low-cost robots and devices. Developers are provided with a sustainable solution to manage verbal and non-verbal interaction through a network connection, with about 3,000 topics of conversation ready for"chit-chatting"and a library of pre-cooked plans that only needs to be grounded into the robot's physical capabilities. The system is structured as a set of REST API endpoints so that it can be easily expanded by adding new APIs to improve the capabilities of the clients connected to the cloud. Another key feature of the system is that it has been designed to make the development of its clients straightforward: in this way, multiple devices can be easily endowed with the capability of autonomously interacting with the user, understanding when to perform specific actions, and exploiting all the information provided by cloud services. The article outlines and discusses the results of the experiments performed to assess the system's performance in terms of response time, paving the way for its use both for research and market solutions. Links to repositories with clients for ROS and popular robots such as Pepper and NAO are given. I N recent years, it has become more and more common to exploit cloud technologies to improve the efficiency of smart systems and devices. In the robotics field, this practice is defined as cloud robotics, i.e., the use of remote computing resources to enable greater memory, computational power, collective learning, and inter-connectivity for robotics applications [1] . Cloud-based solutions reveal to be particularly appealing by considering the plethora of robots hitting the market daily. Most of these robots are often low-cost with limited capabilities in terms of sensors, actuating devices, and onboard computing power. Up to a decade ago, it was unlikely that a family had in their house a device smarter than a cleaning 1 , mopping 2 or lawn mowing robot 3 . However, following the success of home assistants such as Google Home or Alexa, market reports estimate that we can expect to be soon "invaded" by low-cost next-generation assistants and table-robots for health care, companionship, entertainment and education 4 . This process started a few years ago in Asia, with robots such as RoBoHoN, Kabo-chan, SOTA, UNIBO, and Dacky ( Figure 1 ), followed by western countries with robots such as Amazon Astro, Jibo, Pillo, and QTrobot ( Figure 2 ). The process is expected to be further boosted by the Covid-19 pandemic [2] . Each of the aforementioned robots has its own abilities of perception and action, and the ability to interpret specific user requests through either a dedicated smartphone application (e.g., iRobot Home App) or voice commands (e.g., "Hey 4 For instance: https://www.marketsandmarkets.com/Market-Reports/ educational-robot-market-28174634.html and https://www. marketsandmarkets.com/Market-Reports/service-robotics-market-681.html 5 https://luxai.com/robot-for-teaching-children-with-autism-at-home arXiv:2203.02606v1 [cs.RO] 4 Mar 2022 Google...", "Alexa..."). The outsourcing of some robots' capabilities to third-party cloud-based solutions allows the devices to perform tasks that may be too heavy (and then expensive) for local processing. For this reason, it can be a low-cost option to improve the robots' overall performance. Examples of activities that may be convenient to be outsourced are: (i) Dialogue: the ability to autonomously converse with the users, covering numerous topics always keeping into account users' preferences; (ii) Activity planning: the generation of complex plans "grounded" on the robot's capabilities; (iii) Collective learning: the acquisition and sharing of the results with other devices; (iv) Perception: the processing of data acquired by the robot's sensors to perform tasks such as face, speech, and speaker recognition. Based on these premises, the main contribution of this work is a cloud system for autonomous interaction between humans and artificial agents: CAIR (Cloud-based Autonomous Interaction with Robots). CAIR has been developed with the aim of creating a cloud system providing the aforementioned abilities (i) and (ii), allowing social robots and other conversational agents to perform a sustainable knowledge-grounded autonomous interaction. Please notice that we use the term "sustainable" referring to the fact that our solution will be apt to deal with the unprecedented growth of the social robotic market, providing acceptable performance and cost. CAIR is structured in a way that it can be easily expanded by adding new services to improve the capabilities of the clients connected to the cloud. Moreover, the system manages multiple interactions at the same time (i.e., one with each client/robot), and recognizes both the intention of the users to make the agent perform specific actions and the intention to talk about specific topics. The system can be easily used by most devices with Internet connectivity, able to acquire an input through a keyboard or a microphone, and provide an output through a screen or a speaker (e.g., robots, computers, smartphones, smartwatches, etc.). Despite being customisable, please notice that the system is ready to be used: no additional work other than connecting to the cloud is required by social robot developers to manage the conversation or the plans. The system comes with about 3, 000 conversation topics ready for "chit-chatting", and a library of pre-cooked plans to be grounded into the robot's physical capabilities, which will vary from robot to robot. To this end, the developed services are based on the use of REST APIs. REST stands for "Representational State Transfer", and it defines a set of rules to be followed when creating the API [3] . In the last decades, the AI and robotic community have started developing web services following such rules to exploit the numerous advantages that they provide, such as scalability, flexibility, portability, and independence. Finally, it is worth anticipating that the ability of the system to understand the user's intention of making the device perform an action (Section III-C) and naturally converse with the user with mixed-initiative dialogues (Section III-D) is based on a framework for cultural knowledge representation that relies on an OWL 2 Ontology [4] . Being paralleled by a framework for probabilistic reasoning, such an Ontology is designed to take into account the possible cultural differences between different users in a non-stereotyped way, both concerning conversation and action [5, 6, 7] . A mechanism to expand in run-time the cultural knowledge base with new concepts and conversation topics raised by the users (possibly selfidentifying with different cultures) was introduced in [8] . The rest of the article is organized as follows. Section II presents an overview of previous works related to cloud robotics in different areas. Section III describes the architecture of the system, focusing on the design and implementation of the server, and provides a description of the main operations that a client should perform (two examples of already implemented clients are given). Section IV describes the experiment carried on to evaluate the performance of the system in terms of response speed. Section V presents the results and discusses the outcome of the experiments. Eventually, Section VII draws the conclusion. Cloud robotics is a field of robotics that exploits cloud technologies such as cloud storage and cloud computing. Cloud storage allows a client robot, computer, tablet, or smartphone to send and retrieve files online to and from a remote data server. Cloud computing is the delivery of computing resources over the Internet. Such computational resources may include storage, processing power, databases, networking, analytics, artificial intelligence, and software applications. The use of a cloud for other autonomous systems may provide several benefits such as increased computational power, access to high storage space, easier collective robot learning thanks to the sharing of data and control policies, access to parallel grid computing on demand for statistical analysis, motion planning, and knowledge crowdsourcing [1, 9, 10] . For instance, [11] analyzes a cognitive industrial entity called context-aware cloud robotics (CACR) for advanced material handling. In this work, a CACR case study is shown to highlight its energy-efficient and cost-saving material handling capabilities. The decision-making mechanism for energy-efficient and cost-saving material handling is managed by a cloud scheduler that searches for the platforms with the smallest amount of workpieces. A command is sent to the nearest robots to divert them to the required platform and fulfil the material handling request. In this way, the mechanism uses the least amount of robotics resources to meet the requirements of all the manipulation platforms. Another industrial example of cloud robotics are cloudconnected vehicles such as Google's self-driving cars [12] . Autonomous cars use the network to access Google's database of maps, satellite, and environment models (such as Street View) and combine those data with streaming data from GPS, cameras, and 3D sensors. This allows to monitor their position within centimetres, and compare it with past and current traffic patterns to avoid collisions. Each car can gather information about road signs, pavement markings, lane closures, traffic conditions, and send it to the cloud. Such data are then processed and used to improve the performance and safety of all cars. To explore the benefits of cloud-based technologies for sensing, [13] presented a cloud-based collaborative visual Simultaneous Localization and Mapping (SLAM) system consisting of low-cost robots and remote Amazon servers, which allows real-time map estimation through parallel computing and implements a highly sophisticated system using low-cost components, thus lowering the technological threshold for further research. Concerning complex tasks that a single robot is not able to perform, cloud robotics is widely accepted as a promising approach to efficient robot cooperation [14] . This work discusses the potential benefits and critical challenges of robotic cooperation in cloud robotics when dealing with search and rescue in disaster management. The main benefits are the ability to mutually cooperate within the disaster site and the possibility of distributing the algorithms on the remote cloud. Among the critical challenges, there are quality of service issues, communication issues, safety issues, heterogeneity, and security issues. Other examples of cloud-based architectures in the industrial and service robotics domains are discussed in [15, 16, 17] . Despite the numerous benefits, [1] points out several key issues and challenges in cloud robotics such as communication issues due to network problems, latency issues, interoperability and portability due to the variety of infrastructures, platforms and APIs offered by cloud providers, and privacy and security of personal data uploaded on the cloud. In the Social Robotics domain, the outsourcing of services to third-party services in the cloud mostly concerns the aspects already mentioned in Section I: (i) dialogue [18, 19] , (ii) activity planning [20, 21, 22, 23] , (iii) collective learning [24, 25, 26, 27] , and (iv) perception [28, 29, 30, 31] . Social robots and conversational agents need to communicate in a way that feels natural to humans to effectively bond with them and provide an engaging interaction. Concerning dialogue (i), there are several cloud platforms providing tools to build, test, and deploy a conversational agent across multiple devices. Among such platforms we may consider IBM Watson Assistant 6 , Azure Bot Service 7 , and Amazon Lex 8 . However, as explained in [18] , despite the advantages provided by such cognitive services, building a full conversational agent that meets the requirements for social interaction is still challenging from a technical perspective. This is due to issues in the coordination of the cognitive services to build the agent interface, issues related to the integration of the agent with external services, and issues linked to extensibility, scalability, maintenance, and resource costs to run the agent. As regards the interaction capabilities of such conversational agents, they are not able to properly manage the dialogue, they are not easy to integrate with the actions that a robot can perform, they are not "grounded" with the data acquired by their sensors, and they do not have a system for knowledge representation [32] . To solve some of the technical issues related to the development of conversational agents, serverless computing [33] has recently emerged as an alternative way of creating backend applications. The cloud service provider of serverless applications automatically provisions, scales, and manages the infrastructure required to run the code, enabling developers to build applications faster. Traditional cloud computing provides users with the computer resources whether they use them or not, while serverless computing allows users to dynamically pull only the resources they need. Major cloud vendors such as Amazon, Google, Microsoft, and IBM have created serverless versions of their frameworks. Serverless life-cycle costs are typically lower than costs for dedicated infrastructure as their vendors do not charge for idle time. The creation of serverless conversational agents that interact with a set of various commodity services publicly available on the Internet such as a weather service is discussed in [18] . The work also presents a prototype implementation of a chatbot that uses IBM Watson Developer Cloud services as AI building blocks and the Apache OpenWhisk 9 serverless computing service. The conversational agents built using the services provided by these cloud platforms are meant to be used in Q&A scenarios such as providing customer support, booking tickets, and ordering food. An interesting exception is SPeCECA [19] , a smart pervasive chatbot for emergency case assistance based on cloud computing that assists victims or incident witnesses to help avoid deterioration of the subject's condition, maintaining their physical integrity until aids arrive. This type of conversational agents can reply to specific requests of the user, depending on the purpose for which they have been developed, but they do not engage in a complex, goal-oriented, mixed-initiative dialogue. Moreover, as already mentioned, they are not able to ground the conversation to data acquired by their sensors and to interpret the user's intention to make the agent perform specific actions. Recently, the OpenAI company released an API 10 that provides access to GPT-3 11 . Differently from the aforementioned conversational agents, GPT-3 is a generative model based on human-human data that uses deep learning to produce human-like utterances fitting the context, allowing an "open" conversation, instead of working exclusively in Q&A scenarios. However, all generative models have the major drawback that they can learn undesirable features leading to toxic or biased language [34] . This drawback makes them unsuitable in sensitive contexts such as the development of socially assistive robots taking care of older people or in any context when cultural sensitivity may be required. In addition to having the ability of entertaining conversations with the users, social agents should also be able to model and reason about more complex tasks or activities to be carried out in cooperation with other agents. Activity planning (ii) has been extensively analyzed in literature when dealing with a single agent embedded with the required onboard capabilities. Recently, cloud technologies have been used to improve planning capabilities: the computing resources of the cloud can be used to perform more complex computations and gather information that can be useful to all agents connected. A cloud-based system architecture for robotic path planning is presented in [20] . The cloud server contains a path plan database, which stores the solution paths that can be shared among robots. The system also provides an on-demand path planning software in the cloud, which computes the optimal paths for robots to reach the goal positions. The authors have experimentally verified the feasibility and effectiveness of solving the shortest path problem via parallel processing in the cloud. An example of task planning in a dynamic global environment framework is presented in [21] . The work proposes a cloud-based framework for real-time autonomous robot navigation with 3D visual semantic SLAM that exploits on-demand databases to store environment information. Another planning example to manage a fleet of autonomous mobile robots (AMR) using Rapyuta Cloud Robotics Platform is provided in [22] , whereas [23] discusses a multi-robot system based on cloud technologies, designed to execute tasks in a complex and crowded environment. The RoboEarth cloud engine [35] includes a database to store information that can be used in several different scenarios, including action recipes and skills, speeding up robot learning, and adaptation in complex tasks. In all the examples above, the cloud approach shifts the computation load from the agents to the cloud and provides powerful processing capabilities to the multi-robot system. Collective learning (iii) refers to the sharing, storing, and accumulation of information over time, a capability that allows social agents to work together efficiently. Examples of information that agents can post for collective learning are control policies, sensor information of physical traits of an environment, trajectories, tracking data, and updated localization data. To enable robots to perform human-level tasks flexibly in varying conditions, [24] argues that we need a mechanism that allows them to exchange knowledge. One approach to achieve this is to equip a cloud application with a range of encyclopedic knowledge under the form of an Ontology, and execution logs of different robots performing the same tasks in different environments. In a similar spirit, [25] and [26] describe a collective learning environment for ubiquitous robots. Sensors embedded in these robots can provide vast amounts of information that can be beneficial in further processing and collective learning. The RoboEarth three-layered architecture emphasizes the concept that each robot should allow other robots to learn through its knowledge and vice-versa. Thanks to a portfolio of web services, RoboEarth allows access to a database storing information that can be reused in several different scenarios including images, point clouds, models, maps, and object locations. The database also contains the semantic information that is associated with each element through an Ontology. Among applications, [27] investigates and assesses how a cloud robotic system can improve the provisioning of assistive services for the promotion of active and healthy ageing. In this scenario, the presence of a cloud robotic service is fundamental to design agents that can simultaneously monitor more elderly people at the same time, regardless of the time and the location of the seniors, through sensors located in the environment. The agents performed machine-to-machine (M2M) and machine-to-cloud (M2C) communications [36] to exchange data between them and the cloud. Perception (iv) assumes significant importance for humanrobot interaction. It is reasonable to identify four main classes of signals captured by a social robot: visual-based, audiobased, tactile-based, and range sensors-based. Robots collect such data through cameras, microphones, tactile sensors, and proximity sensors such as laser range finders, ultrasounds, infrared, or even RGB-D cameras. Semantic understanding includes processing and merging of sensor data for tasks such as speech-to-text translation, sound localization, natural language understanding, activity, gesture, posture, and emotion recognition, object localization and recognition, and many others. According to this rationale, [28] and [29] discuss the design and implementation of face recognition applications exploiting the benefits of cloud computing. The popularity of handy smart devices allows healthcare providers to monitor patients' health without visiting them. As proof, the work described in [30] proposes a cloud-assisted speech and face recognition framework for elderly health monitoring, where handheld devices or video cameras collect speech, along with face images, and deliver them to the cloud server for possible analysis and classification. A cloud-based framework for speech enabling healthcare is proposed in [31] . A person seeking some medical assistance can send their request by speech commands to a cloud server where such requests are managed and processed. Any doctor with proper authentication can receive the request and assist the person. Despite the examples provided, market attempts to exploit cloud computing with social agents are still limited and they are mainly confined to services offering dialogue capabilities. This section describes the architecture of the CAIR system, starting from a general overview, then detailing the implementation of the server and explaining the basic operations that a client should perform. The system is based on a client-server architecture to simplify the access to its services and the addition of new functionalities. Figure 3 depicts the architecture of the system. The server is composed of three web services developed in Python: (1) the Hub (Section III-B) that manages the incoming requests from the client, (2) the Plan Manager (Section III-C) that recognizes the intention of the user to make the agent execute a task, and (3) the Dialogue Manager (Section III-D) that manages the dialogue and recognizes the user's intention of talking about a specific topic. To provide appropriate answers and plans, the server exploits an Ontology [4] implemented in OWL 2 [37] containing all the topics, sentences, and plans used during the interaction with the user. The Flask-RESTful 12 framework is used to develop the web services. The client can perform requests to the server using REST APIs. As already mentioned in Section I, REST is a set of rules that should be followed when creating the API. One of these rules states that the client should be able to get a piece of data (called a resource) when linked to a specific URI. The operation performed by the user when accessing the resource is called request, while the data sent back to the user is called a response. Any web service that obeys the REST constraints is informally described as RESTful. Due to the separation between client and server, this protocol allows developers to use different syntax on different platforms. As shown in Figure 3 , the client simply has to acquire the user sentence, send it to the Hub service along with the client state, parse the response, store the updated client state, execute the received plan, and/or reply with the dialogue sentence returned by the Hub. How these operations are performed depends on the capabilities of the client. The acquisition of the sentence can be performed both through a keyboard or a microphone, the plan will be performed only if the device running the client has the appropriate physical capabilities, otherwise, it will be ignored. Eventually, the dialogue sentence will be displayed through a screen or it will be communicated to the user through a speaker (assuming that the device is also equipped with a voice synthesizer). Please recall the role of the client state, which contains all the updated information about the client and is sent back and forth between the client and the server, to avoid storing any sensitive information on 12 https://flask-restful.readthedocs.io/en/latest/ the server. The Hub service is the facade of the CAIR system, and it is designed to receive all the requests from the clients. The diagram in Figure 4 shows the sequence of the interactions between the client and the services composing the CAIR system. The client sentence will be processed in pipeline by the Plan Manager 13 (which aims to recognize the user's intents) and the Dialogue Manager (which aims to move the conversation forward). The Plan Manager will return a plan (and an appropriate introductory sentence to the plan) if and only if an intent is recognized, whereas the Dialogue Manager will always produce a sentence that the agent may say to move the conversation forward. More specifically, the first operation that the Hub performs is a request to the Plan Manager service to check if the user sentence matches any of a pre-defined set of intents associated with specific plans to be executed. These plans could either represent an action to be executed by the client or affect the knowledge base and/or the flow of the dialogue (in this case, referred to as a KBplan). If so, the service will return the KBplan or the plan, along with introductory plan sentence to be used by the client before performing the first action in the plan (the difference between plan and KBplan is explained more in detail in Section III-C). The KBplan and the plan sentence will be collected together with the client sentence and the client state to perform a request to the Dialogue Manager service. The second operation that the Hub performs is a request to the Dialogue Manager, which will reply with a sentence to move the conversation forward. Figure 4 shows that, in this process, the Dialogue Manager may query the Plan Manager if the user agrees to perform an activity proposed by the system, which may require a corresponding plan. The Hub collects all the information provided both by the Plan Manager and the Dialogue Manager. Such a set of data is finally returned as a response to the client, which will implement it through its sensorimotor and verbal capabilities. The Plan Manager service receives as input the user sentence. Its purpose is to find a match between such a sentence and one of the trigger sentences associated with a specific intent. An intent is defined by (i) a set of trigger sentences, (ii) one or more plan-specific sentences (if any), (iii) a KBplan (if any), (iv) and a plan (if any). Figure 5 shows some examples of intents that can be recognized by the system. In the current implementation, sentence matching is performed based on pattern-based syntax matching and allows the extraction of parameters from the matched sentence that can be used to dynamically compose the plan sentences, the plan, and the KBplan. More complex models merging syntax matching and data-driven example-based approaches [38] can be easily integrated without any impact on the general structure. A KBplan, where KB stands for Knowledge Base, is a sequence of actions meant to affect the knowledge base and/or the flow of the dialogue. For instance, if the user says "I love music", this sentence will match the trigger sentences of the Appreciation Intent ( Figure 5 ) meant to recognize the user's appreciation for something and extract the loved thing as a parameter. The KBplan of this intent is composed of two actions: the first one is meant to modify the probability that the user wants to talk about the extracted parameter (what we will refer to as "likeliness" in the following Section), while the second one brings the information that the system should jump to that conversation topic (if present in the Ontology). A plan is a sequence of actions that should be executed on the client (given that it has appropriate capabilities to execute it) as it does not affect the knowledge base or the flow of the dialogue. For instance, if the user says "Play the song Hey Brother", this sentence will match one of the trigger sentences of the Music Intent ( Figure 5 ) that recognizes the user's intention to listen to some music. The plan of this intent is composed of a single action carrying the information that the client should play the song having the title contained in the parameter field (see Section III-E). Figure 3 shows that, if the Plan Manager service finds a match with an intent, a response containing the KBplan, the plan sentence, and the plan is returned to the Hub service. Finally, notice that triggered intents might be interpreted as goals to be achieved by planning a sequence of actions in run-time, instead of having pre-cooked sequences of actions. Thanks to the versatility of RESTful APIs, programmers may easily add a planner based on the most popular formalisms such as PDDL [39] , Hierarchical Task Networks [40, 41] , Answer Set Programming [42, 43] , Partially Observable Markov Decision Process [44, 45] without altering the general structure. In this way, a sequence of actions might be produced considering constraints related to the actual capabilities of the physical device that needs to implement them, including those related to interaction with humans [46, 47] . The Dialogue Manager service is in charge of managing the dialogue. Following previous work [6, 5, 48] , the system has been designed with the capability of conversing with the users taking into account their cultural background, by relying on a framework for cultural knowledge representation that relies on the Ontology implemented in OWL 2. According to the Description Logics formalism, concepts (i.e., conversation topics that the system is capable of talking about) and their mutual relations are stored in the terminological box (TBox) of the Ontology. Instead, instances of concepts and their associated data (e.g., chunks of sentences automatically composed to enable the system to talk about the corresponding topics) are stored in the assertional box (ABox). To deal with representations of the world that may vary across different cultures [49] , the Ontology is organized into three layers (as explained more in detail in [5, 6] ). The TBox ( Figure 6 ) encodes concepts at a generic, culture-agnostic level and includes concepts that are typical of all cultures considered, whichever the cultural identity of the user is, to avoid stereotypes. Consider beverages: the system may initially guess the user's preferred beverages depending on the country they live in. Nevertheless, it should be open to considering choices that may be less likely for a given culture, as the user explicitly declares their attitude towards them. For instance, the system may initially infer that an English person may be more interested to talk about Tea rather than Coffee, and the opposite may be initially inferred for an Italian user. However, during the conversation, initial assumptions may be revised. This mechanism leads to a fully personalized representation of the user's attitude towards all concepts in the TBox to be used for conversation. To implement this mechanism, the Culture-Specific ABox layer (CS-ABox in Figure 6 ) contains instances of concepts encoding culturally appropriate chunks of sentences to be automatically composed (Data Property hasSentence) and the probability that the user would have a positive attitude towards that concept, given that he/she belongs to that cultural group. This idea has already been introduced in the previous section, where we used the term "likeliness" referring to the probability of having a positive attitude towards a concept in the Ontology (Data Property hasLikeliness). Eventually, the Person-Specific ABox (PS-ABox in Figure 6 ) comprises instances encoding the actual user's attitude towards a concept updated during the interaction (the system may discover that Mrs. Dorothy Smith is more familiar with having tea than the average English person, instance DS_TEA with hasLikeliness=Very High). The PS-ABox also contains sentences, associated with each topic, which were explicitly encoded/added to the system by the caregivers (hasSentence="You can never get a cup of tea large enough or a book long enough to suit me"). During the first encounter between the robot and a user, many instances of the Ontology will not contain Person-Specific knowledge: the robot will acquire awareness about the user's attitude at run-time, either from its perceptual system or through verbal interaction, e.g., asking questions. A Dialogue Tree (DT) (Figure 6 ), used by the conversation system to chit-chat with the user, is built starting from the Ontology structure: each concept of the TBox and the corresponding instances of the ABox are mapped into a conversation topic, i.e., a node of the tree. The Object Property hasTopic and the hierarchical relationships among concepts and instances are analyzed to define the branches of the DT. In the example of Figure 6 , the instance of Tea for the English culture is connected in the DT to its child node GreenTea (which is a subclass of Tea in the Ontology) and its sibling MilkTea (since EN_MILK is a filler of EN_TEA for the Object Property hasTopic). Each conversation topic has chunks of culturally appropriate sentences associated with it, that are automatically composed and used during the conversation. Such sentences can be of different types (i.e., positive assertions, negative assertions, or different kinds of questions) and may contain variables that are instantiated when creating the tree. For instance, a hypothetical sentence "Do you like $hasName?" encoded in the concept Coffee might be used to automatically produce both "Do you like Coffee?" and "Do you like Espresso?" (being Espresso a subclass of Coffee). In the basic version used for testing, the taxonomy of the Ontology allowed us to easily produce a DT with about 3, 000 topics of conversation and more than 20, 000 sentences, with random variations made in run-time. The reason for using a "safe" Ontology-based mechanism rather than a generative model has already been explained in Section II. However, in non-sensitive contexts where there is no need to have full control of what the robot says, an alternative solution might be pursued. To implement the knowledge-based conversation mechanism, the Dialogue Manager service requires as input the client sentence, the client state, and the KBplan. As described more in detail in Section III-E, the client state contains the current conversation topic, the type of the previous sentence chosen by the system (e.g., yes/no question, open question, assertion, etc.), the type of the following sentences that the system should use for that topic, the likeliness of all topics that have been mentioned in the conversation up to that moment (about which the user may have expressed a preference), and the sentences that have already been used by the system to talk about the covered topics (to avoid, as much as possible, unnecessary repetitions). Based on the Dialogue Tree, and referring to Algorithm 1, the key ideas for knowledge-driven conversation can be briefly summarized as follows (the whole process [5] is more complex). Firstly, recall that the Hub queries the Plan Manager before the Dialogue Manager. Hence, when the Dialogue Manager is queried, it receives the client sentence, the client state, and the KBplan returned by the Plan Manager (the latter can be empty if no intent was matched). Each time the Dialogue Manager receives the required information from the Hub, it checks if the KBplan contains at least an element (action). Currently, the KBplan can either contain both the "setlikeliness" and the "jump" action (as depicted in the Appreciation Intent in Figure 5 ) or it can be empty: the possibility of adding new actions in the future versions of the system has been expressed in Algorithm 1 by adding an empty else if statement at line 16. The "setlikeliness" action shall be interpreted as an explicit attempt to declare the attitude of the user towards a topic, i.e., when the user says "I love music" the action tells the Dialogue Manager to update the likeliness of the concept "music" (if it exists in the Ontology). Note that the likeliness is modified only for that specific client/user and that the server does not store any information about clients/users. The likeliness of the topic is updated only in the client state (sent back and forth between the client and the server), and it is set to the value contained in the corresponding field of the action (line 7). The "jump" action requires the Dialogue Manager to change the conversation topic to the one specified in the "topic" field of the action (if it exists in the Ontology) and to choose a sentence associated with the new topic. Since there are different sentences of different types for each topic (e.g., yes/no question, open question, assertion, etc.), the type of the sentence to be selected is specified in the "startsentence" field of the action. As an example, in the Appreciation Intent shown in Figure 5 , the "startsentence" parameter of the "jump" action is "p", which means that the system will choose one among the positive (affirmative) sentences associated with the topic extracted by the Plan Manager, starting from the client sentence (e.g., "Music is good for your health!"). After updating the client state (line 13) with the new topic and all the related information (i.e., likeliness, sentences that have already been used by the system, and the type of the following sentences to be used for that topic), the algorithm returns the chosen sentence and the updated client state (line 14). The client state will be stored on the client side and sent again to the server during the next interaction. In case the KBplan does not contain any element (line 21 and below), the Dialogue Manager algorithm checks if the client sentence contains keywords encoded in the Ontology in a corresponding Data Property. To match a topic, at least two keywords associated with that topic should be detected in the sentence pronounced by the user, using wildcards to enable more versatility in keyword matching. As for intent matching, a more sophisticated mechanism for sentence matching might be adopted, merging syntax matching and data-driven example-based approaches [38] . The use of multiple keywords allows the system to differentiate between semantically close topics (i.e., Green Tea rather than the more general concept of Tea). • If the user sentence contains the two keywords corresponding to at least a topic in the Ontology (line 23), the algorithm chooses one among the matching topics (randomly or based on their likeliness) to proceed with the dialogue. Specifically, every time the conversation topic changes, the system starts talking about the new topic by randomly picking a sentence of a given type (line 25). The client state is updated and returned along with the chosen sentence (lines 26 and 27). • Otherwise: -If there are no more relevant questions to be asked or assertion to be made about the current conversation topic (line 29), the next topic is chosen either by exploring the DT along its branches (i.e., favouring semantically close topics, if any) or choosing a completely different topic close to the root of the DT. As in the previous case, the system will start talking about the new topic by randomly picking a sentence of a given type (line 31). The client state and a dialogue sentence are returned (lines 32 and 33). -If there are still relevant questions to be asked or assertions to be made about the current conversation topic (line 34), the system stays on that topic. While in the same topic, a new sentence is randomly chosen at each iteration among a portfolio of different types of sentences (e.g., yes/no question, open question, assertion, etc.). The client state and a dialogue sen-tence are returned (lines 36 and 37). As shown in Figure 4 , the Dialogue Manager can also directly perform a request to the Plan Manager service. For the sake of simplicity, this eventuality has not been reported in Algorithm 1. However, during the conversation, the system can also propose activities to the user, e.g., "Do you want me to play some music?" (proposals for activities are among the sentence types that can be chosen while exploring a topic). In case the user answers affirmatively, the Dialogue Manager sends to the Plan Manager a trigger sentence associated with the proposed activity, and stores the response (i.e., the plan and the plan sentence). This information, if present, will be then returned as a response to the Hub along with the dialogue sentence (i.e., the actual continuation of the dialogue), and the updated client state. If a client has never interacted with the server, the first thing that it should do is to perform a request to the cloud, in particular to the Hub service, to obtain the initial client state. The state will be stored locally and retrieved before all the following requests. Please remember that, thanks to this mechanism, we store all relevant information on the client, therefore creating a server that is "stateless" in all possible senses. Together with each request, the client communicates its current state, which will be updated by the server and transmitted back, together with other data. In addition to being beneficial in terms of privacy, since no personal data is stored on the server, this makes the whole RESTful architecture very efficient to debug. All CAIR services, possibly residing on different machines for load-balancing purposes, can be called at any time without any need for synchronization. Along with the client state, the first request also returns a sentence to begin the conversation with. After the initial request, every time the client interacts with the system it should provide its client state, containing information about: • The current conversation topic; • The type of the last sentence chosen by the Dialogue Manager (e.g., yes/no question, open question, assertion, etc.), which will be used to move the conversation forward; • A list containing the types of the following sentences that the server may use while in the current conversation topic; • The set of the "likeliness" values that have been modified during previous conversations with a given user, encoding their personal preferences about covered topics. These values may override the likeliness values that were set based on the user's cultural background; • The set of sentences that have already been suggested by the Dialogue Manager, to limit repetitions of the same sentences. Along with the initial client state, the client should acquire the user sentence (e.g., through a text-or speech-based interface, depending on its physical capabilities), and send them to the Hub. As soon as the client receives a reply, it shall manage the response appropriately. This includes storing the updated client state, communicating the plan sentence to the user (e.g., through a text-based or audio-based interface), performing the actions contained in the plan field of the response, and eventually continuing the dialogue by communicating the dialogue sentence. If the client is not able to execute certain actions, it can ignore them and consider only the dialogue reply (e.g., the Pepper robot could perform the action "Go to the kitchen", while the Pillo robot could not; yet, Pillo can dispense pills). It is worth restating that the client state does not contain personal information about the user (e.g., name, gender, phone number, etc.). Whenever needed, this information can be stored locally on the client device and substituted to the placeholders contained in the dialogue sentence. For instance, the placeholder $name in the sentence "Hello $name, how are you?" should be substituted on the fly with the locally-stored name of the user, before saying/visualizing the sentence. An example of a simple client for PC is available 14 . The code is provided along with a guide with a detailed explanation of how it works and of all the plans that the server can return to the client, based on the intent that has been matched. Another documented example of a full client for the SoftBank Robotics robots Pepper and NAO, which manages all the plans returned by the server, is available as well 15 . A video showing some extracts of the interaction is available on YouTube 16 . A ROS2 wrapper for the client has also been developed 17 . The system described in this paper is currently being used in our laboratory for research in Human-Robot Interaction (Figure 7) . 14 https://github.com/lucregrassi/CAIRclient example 15 https://github.com/lucregrassi/CAIRclient SoftBank 16 https://www.youtube.com/watch?v=hgsFGDvvIww 17 https://github.com/lucregrassi/CAIRclient ROS2 Experiments to evaluate the system's capabilities in terms of the quality of the interaction, usability, as well as the impact on life quality were performed with care home residents in the UK and Japan [7] . The goal of the experiments described in this article is to analyze the system performance in terms of response speed when multiple clients connect to the cloud in different configurations. A system with good performance paves the way to the possibility of making it available as a cloud service for robotic scientists and companies around the world. Some initial considerations are necessary. From the experiments conducted in [50] , it emerges that users have the highest level of satisfaction with a maximum delay of two seconds during the conversation. As a confirmation of this, the empirical study in [51] about the response time of a communication robot revealed that users' evaluation of the robot's reply worsened after a response delay of two seconds. Also, we assume that, when interacting with our system, a user might perform, on average, six requests per minute: this takes into account the time required by the person to talk, the time required by the cloud server to reply, and the one needed by the robot to talk. In the Literature there are various performance tests for cloud system [52] : • Baseline test: examines how a system performs under expected or normal load and creates a baseline with which other types of tests can be compared. It aims to find metrics for system performance under normal load; • Scalability Test: checks whether the system scales appropriately to the changing load; • Load test: finds metrics for system performance under high load; • Stress test: finds the load volume where the system breaks or is close to breaking; • Soak test: finds system instabilities that occur over time and makes sure no unwanted behaviour emerges over a long period. Typically, the main factors that can affect the performance of a web service are the payloads (request and response bodies) and the number of requests within a certain amount of time. Therefore, we first focused on executing the Baseline test to evaluate the performance of our system in terms of average response time (i.e., the difference between the time when the request was sent by the client and the time when the response was fully received). We performed this test considering four different payloads (each containing the client sentence and a client state of different size) to understand the impact of the request data size on the response time. Then, we executed the Scalability Test to assess how the average response time increases with a growing number of requests. This aspect is fundamental because it ensures that the system is sustainable (in terms of the response speed), even when many devices are using the system. The other performance tests were not performed as our system is still in its development stage, and our current aim is not to assess how the system behaves under high loads or find instabilities over a long activity period. The web services composing the CAIR server were located on a machine hosted on a cloud service. The machine, running Ubuntu 20.04, was equipped with 16GB of RAM, 2vCPUs of an Intel Cascade Lake 8260 @2.4Ghz. The clients were running on an M1 MacBook Pro with 16GB of RAM, connected to the Wi-Fi network of our laboratory with a Download and Upload speed of approximately 80Mbps, and a 10 ms ping. The experiments were performed using Wi-Fi as we assume that most devices would use this type of connection rather than an Ethernet cable. To ensure that the results were not biased by any network overload, we carried out the same experiments using the Ethernet, and we verified that the response speed difference was always negligible. The HTTP requests that clients would perform to the server were simulated with the Apache JMeter application. JMeter [53] is a Java application designed to load test functional behaviour and measure performance. Its full multi-threading framework allows concurrent sampling by many threads and simultaneous sampling of different functions by separate thread groups. Being open-source, simple to use, platformindependent, and suitable to perform any kind of performance test, JMeter is frequently mentioned as a viable solution in the Literature [54, 55] . This application allows simulating a group of threads (i.e., different clients/users) sending requests to a target server. A parameter called "ramp-up period" tells JMeter how long it takes to "ramp-up" to the full number of threads chosen. If 10 threads are used, and the ramp-up period is 100 seconds, then JMeter will take 100 seconds to get all 10 threads up and running: each thread will start 10 (100/10) seconds after the previous thread was begun. Once the simulation ends, JMeter returns statistics that show the performance of the target server. For each thread group, in addition to measuring the average response time (provided by JMeter), we also kept track of the average server processing time: that is, the time the Hub takes to respond to the client, measured from the first to the last instruction executed by the server (i.e., without considering the time required by the server for the management of multiple requests on the queue, the creation of threads to handle incoming requests, etc.). The difference between the average response time and the average server processing time provides fundamental information on how the network influences the response time, and on how contemporary requests are managed on the server-side. The number of clients concurrently accessing the web services is not the only element that may affect the performance. As already described in Section III-E, the client state contains fundamental information about the conversation with a specific client. The amount of information encoded in the state grows as the dialogue proceeds. Among the elements that contribute the most to the increase of the client state, every time the user expresses a preference about a topic, the corresponding likeliness is modified and its value is stored in the client state. Also, every time a new topic is mentioned, the state must keep a record of the sentences that have already been used by the system when talking about that topic. Given that the client state can grow in subsequent iterations, it was crucial to test how the increasing size of the request payload affects the average response time. According to this rationale, the Baseline test was performed by considering four scenarios: 1) The client has just started interacting with the system. No likeliness value has yet been modified and no sentence has yet been used by the system. In this case, the payload size was 369 bytes; 2) The client has expressed a preference about 1/3 of the topics present in the Dialogue Tree (i.e., 926 topics) and information about the sentences associated with those topics has been encoded in the client state, increasing the payload size to 6384 bytes; 3) The conversation has gone further, and 2/3 of the topics have been covered (i.e., 1853 topics and 12213 bytes); 4) The client has expressed a preference about all the topics present in the Dialogue Tree (i.e., 2780 topics) and information about all the sentences present in the Ontology have been encoded in the client state, reaching the maximum payload size of 18166 bytes. For each of the previous scenarios, 30 requests spaced five seconds apart were performed by a single thread/client. We recorded the response times and the server processing times of each request and then computed the average time for the two sets of times to determine to what extent they were impacted by payload size. The Scalability test was carried out in two scenarios. The first one simulated an increasing number N of users performing requests simultaneously, while the second scenario foresaw an increasing number N of users performing evenly distributed requests over a 10 seconds period. For the first Scalability Test, the aforementioned ramp-up period was set to zero, meaning that JMeter started all the threads almost at the same time, while for the second Scalability Test, the ramp-up period was set to 10 seconds. To cover the worstcase scenario, both tests were performed using the greatest request payload (i.e., simulating clients that have expressed a preference about all the topics in the DT). For each scenario, we measured the response times and server processing times of the group of N threads for 30 times, with the N threads distributed over the respective ramp-up period. Each group of N threads started approximately five seconds after the previous group had ended. Then, we computed the average response time and the average processing time versus the number N of users in the two configurations (i.e., simultaneously or over 10 seconds). For these experiments, we established a threshold of one second as the maximum acceptable average response time. Please notice that the threshold is below the delay reported in experiments about people's perception during a dialogue with a conversational system [50, 51] . Setting the threshold to one second arranges for variations that can be due to the load, the network performance, or the additional time required to perform the speech-to-text transcription. This ensures a higher satisfaction during the conversation. V. RESULTS Figure 8 shows the results of the Baseline test. On the x-axis, we reported the payload size of the request in the four considered scenarios. The y-axis shows the time in ms, computed as the average of 30 independent requests (i.e., with no overlapping time). The blue line represents the average response time, while the orange line depicts the average server processing time. Figure 9 depicts the results of the Scalability Test carried out with N contemporary requests performed by N threads, while Figure 10 reports the results of the Scalability Test performed distributing the N requests over a 10 seconds period. In both cases, the x-axis reports the number N of requests performed during the respective ramp-up period, while the yaxis shows the time in ms, computed as the average of 30 independent tests (each with N users). As in the previous graph, the blue line reports the results for the average response time, while the orange line depicts the results for the average server processing time. The dotted line highlights the threshold for the average response time. Note that, during all the experiments, the bandwidth occupation was always far below the maximum capacity of the used connection (measured with a speed test 18 ). By looking at Figure 8 , it can be noticed that the standard deviations of the response times are always higher than the processing times: this may be due to fluctuations in the speed of the network. Also, when performing the Baseline test with the greatest payload, the average response time reached a value of 189 ms, while the average server processing time was 107.4 ms. This difference is mainly due to the network latency that strongly influences the response time. However, this test suggests that, even in the worst-case scenario (i.e., with the highest payload size), the performance of the system under a normal load (i.e., non-overlapping requests) guarantees a high level of satisfaction of the user interacting with the system. From Figure 9 , it is immediate to notice that, as in the previous test, the standard deviations of average response times are always higher than those of average server processing time. When considering the average response time, the threshold of one second is reached with approximately 20 simultaneous requests, while the average server processing time reaches the threshold with around 50 contemporary requests. Measuring the server processing time shall be interpreted as a limit situation to measure how the system will behave in case no network delay is present (simulating the case of a local server). Observing the results in Figure 10 , in addition to the huge difference between the standard deviations, it can be seen how the average response time reaches the threshold with around 250 requests distributed over a 10 seconds period, while the average server processing time tends to stabilize at around 450 ms. This is since the server can manage up to a certain number of contemporary requests and those arriving after are queued and they have to wait some time before being processed. The server processing time, measured from the first to the last instruction executed by the server, tends to be constant. Delays are mostly due to the network and the queuing mechanism for managing a huge number of concurrent requests. Given these results, it is interesting to make some considerations about the maximum number of users that the server can handle. We assumed that, when interacting with the system, a user can perform, on average, six requests per minute. However, it is also reasonable to assume that among all users M that subscribed to the system and can virtually use it, only a ratio R will be concurrently accessing its services at a given time of the day. This yields to N = RM users that can concurrently make a request. However, computing R is not the purpose of this work. The computation of R should be based on several considerations, including users' geographical distribution, the times of the day when they are more likely to connect to the system, how long they use the system on average, and so on. Suppose that R = .2: if 200 users are subscribed to the cloud services, it turns out that N = 40 users may concurrently connect to the services. Knowing R plays a key role for cloud sizing as it allows to know the number of users M who can subscribe to the system, depending on the number of users N that can concurrently connect to the cloud without compromising the quality of service. If we know the maximum number N of concurrent users (in our tests, N = 20 in the unrealistic case of simultaneous requests, and N = 250 in case the request are uniformly distributed), we can estimate the maximum number of users that can subscribe to the system, as M = N/R. In our case, M = 100 if we want to ensure a 1-second delay under any possible condition, while M = 1250 in case we aim to average performance. As a final consideration, it may be observed that the proposed approach, despite the benefits originating from not storing the conversation state on the server, has obvious limitations. If the Ontology is expanded and more conversation topics are added, the maximum size of the client state grows, leading to an increase of the response time. However, our previous experiments with care home residents [56] revealed that the current Ontology with about 3, 000 topics of conversation is quite comprehensive to allow users to have an engaging conversation with the system. It is quite unlikely that the situation in which all 3, 000 topics have been explored occurs. This situation would produce a client state that reaches the maximum theoretical upper bound. However, people tend to repeatedly explore topics they like the most, and this is especially true for older people. The system allows the users to express their own opinion and tell their own story by paying interest in what they say, a possibility that people appreciate more than listening to what the robot has to say. If these considerations should not be confirmed in the future, moving part of the client state to the cloud is a possibility to be explored. The work presented the architecture of a cloud system designed to allow interaction with social robots and other conversational agents. Our proposal is based on the consideration that a huge number of low-cost smart devices for social interaction are expected to soon hit the market and will not have the onboard computational capabilities to perform the complex operations that are required for interacting with humans. The server includes an easily expandable portfolio of web services that comply with REST rules. A Hub service collects the requests and redirects them to the Plan Manager and the Dialogue Manager services. The former is in charge of recognizing the user's intention to execute an action, while the latter manages the dialogue. To provide appropriate plans and answers, these services exploit the information encoded into an Ontology, specially designed to take into account the cultural background of the user. The cultural aspect of the Ontology is not discussed in detail in this article (see [6, 5] ). The proposed system was designed to be used with several devices equipped with Internet connectivity and capable of interacting with the users. The client devices should perform requests to the Hub service and interpret the replies received from the cloud. The capability of interpreting the replies to properly interact with the humans and the environment depends on the physical capabilities of the devices in terms of sensors and actuators. The experiments carried out aimed at assessing the performance of the system in terms of response speed. The Baseline test was meant to investigate how the system performed under typical load. The results of the test showed that even with the maximum payload, the average response time is within 200 ms. The Scalability tests had the objective of investigating whether the system scaled appropriately to an increasing load. The first one revealed that the system can support up to 20 simultaneous requests, while the second one proved that about 250 users can perform evenly distributed requests over 10 seconds, without exceeding an average response time of one second. These findings will provide us with the basis to size the system, paving the way to a sustainable solution for verbal and non-verbal interaction with low-cost robots and other smart devices. Cloud robotics: Current status and open issues Robots come to rescue: How to reduce perceived risk of infectious disease in covid19-stricken consumers? Rest api design rulebook: Designing consistent restful web service interfaces Formal ontology and information systems A feasibility study of culture-aware cloud services for conversational robots Knowledge representation for culturally competent personal robots: Requirements, design principles, implementation, and assessment Knowledgegrounded dialogue flow management for social robots and conversational agents Knowledge triggering, extraction and storage via human-robot verbal interaction A survey of research on cloud robotics and automation A comprehensive survey of recent trends in cloud robotics architectures and applications Context-aware cloud robotics for material handling in cognitive industrial internet of things Waymo -formerly the Google self-driving car project Cloud-based collaborative 3d mapping in real-time with low-cost robots A study of robotic cooperation in cloud robotics: Architecture and challenges Industrial cloud robotics towards sustainable manufacturing Cooperative cloud robotics architecture for the coordination of multi-agv systems in industrial warehouses Cloud robotics in smart manufacturing environments: Challenges and countermeasures Building a chatbot with serverless computing Spececa: a smart pervasive chatbot for emergency case assistance based on cloud computing Path planning as a service ppaas: Cloud-based robotic path planning A realtime autonomous robot navigation framework for human like high-level interaction and task planning in global dynamic environment Managing a fleet of autonomous mobile robots (amr) using cloud robotics platform Cloudbased multi-robot path planning in complex and crowded environment with multi-criteria decision making using full consistency method The exchange of knowledge using cloud robotics Ubiquitous robotics: Recent challenges and future trends A study of effective social cues within ubiquitous robotics A cloud robotics solution to improve social assistive robots for active and healthy aging Cloud-vision: Real-time face recognition using a mobile-cloudlet-cloud acceleration architecture Face recognition system (frs) on cloud computing for user authentication Cloud-assisted speech and face recognition framework for health monitoring Automatic speech recognition using interlaced derivative pattern for cloud based healthcare system A review of verbal and non-verbal human-robot interactive communication Serverless computing: Design, implementation, and performance The radicalization risks of gpt-3 and advanced neural language models Cloud robotics: architecture Owl 2 web ontology language structural specification and functionalstyle syntax Neural belief tracker: Data-driven dialogue state tracking Pddl2.1: An extension to pddl for expressing temporal planning domains Htn planning: Complexity and expressivity Autonomously constructing hierarchical task networks for planning and human-robot collaboration Answer set programming and plan generation Answer set programming for collaborative housekeeping robotics: representation, reasoning, and execution Planning and acting in partially observable stochastic domains Online planning algorithms for pomdps Human-robot proxemics: Physical and psychological distancing in human-robot interaction Humanrobot proxemics: Physical and psychological distancing in human-robot interaction The caresses study protocol: testing and evaluating culturally competent socially assistive robots among older adults residing in long term care homes through a controlled experimental trial Ontology is just another word for culture: Motion tabled at the 2008 meeting of the group for debates in anthropological theory Understanding user perceptions of robot's delay, voice quality-speed trade-off and gui during conversation How quickly should communication robots respond? Api testing strategy Comparative analysis of web applications using jmeter Research on performance automation testing technology based on jmeter The caresses randomised controlled trial: Exploring the healthrelated impact of culturally competent artificial intelligence embedded into socially assistive robots and tested in older adult care homes She got her master's degree in Robotics Engineering in 2020 with the thesis "A Knowledge-Based Conversation System for Robots and Smart Assistants", and she is currently pursuing her Ph.D. on Cloud services for social robots and multiparty interaction between humans and artificial agents The authors declare that they have no conflict of interest. Professor