key: cord-0585849-9dnljk6o
authors: Srivastava, Biplav; Rossi, Francesca; Usmani, Sheema; Bernagozzi, and Mariana
title: Personalized Chatbot Trustworthiness Ratings
date: 2020-05-13
journal: nan
DOI: nan
sha: 674faed0e7487563bdee3cd29213ff30d1f11a65
doc_id: 585849
cord_uid: 9dnljk6o

Conversation agents, commonly referred to as chatbots, are increasingly deployed in many domains to allow people to have a natural interaction while trying to solve a specific problem. Given their widespread use, it is important to provide their users with methods and tools to increase users awareness of various properties of the chatbots, including non-functional properties that users may consider important in order to trust a specific chatbot. For example, users may want to use chatbots that are not biased, that do not use abusive language, that do not leak information to other users, and that respond in a style which is appropriate for the user's cognitive level. In this paper, we address the setting where a chatbot cannot be modified, its training data cannot be accessed, and yet a neutral party wants to assess and communicate its trustworthiness to a user, tailored to the user's priorities over the various trust issues. Such a rating can help users choose among alternative chatbots, developers test their systems, business leaders price their offering, and regulators set policies. We envision a personalized rating methodology for chatbots that relies on separate rating modules for each issue, and users' detected priority orderings among the relevant trust issues, to generate an aggregate personalized rating for the trustworthiness of a chatbot. The method is independent of the specific trust issues and is parametric to the aggregation procedure, thereby allowing for seamless generalization. We illustrate its general use, integrate it with a live chatbot, and evaluate it on four dialog datasets and representative user profiles, validated with user surveys.

Conversation is a hallmark of intelligence and a major way in which humans communicate. This is why businesses wanting to build AI-based systems to increase productivity and improve customer experience are interested in automated conversation systems, dialog systems, or chatbots. There are many platforms available to create chatbots (Accenture 2016; McTear, Callejas, and Griol 2016) .

However, such systems can be fraught with ethical risks. An extreme example is the Tay (Neff and Nagy 2017) Twitter chatbot, released by Microsoft in 2016, that was designed Copyright c 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

to engage with people on open topics and learn from feedback, but ended up getting manipulated by users to exhibit unacceptable behavior via its extreme responses. Another example is shown in Figure 1 , where a chatbot to answer train delay information may raise unexpected trust issues (Mishra, Gaurav, and Srivastava 2018) . It can become abusive, leak user travel information to other users and developers and have incomprehensible conversational style (see more examples in Illustration section later). More generally, several potential ethical issues in dialogue systems built using learning methods can be identified, such as showing implicit biases from data, being prone to adversarial examples, being vulnerable to privacy violation, the need to maintain safety of people, and concerns about explainability and reproducibility of responses (Henderson et al. 2017 ).

After a chatbot is built, it is usually deployed as a shared service available to users via Internet-connected devices (such as computers or mobiles) or embodied systems (such as robots, Amazon Alexa, or Google Home). In the process of interaction, data is passed between the user and the chatbot service provider. Many interested parties could be concerned about the chatbot's behavior. They include at least the following ones: Users: Users are concerned with the value they derive by interacting with the chatbot and expect it to follow social and business norms similar to those they expect from human assistants. One example is that the user's information, whether sensitive or otherwise, be not revealed to other users without their permission. The repercussion of such a breach is not only loss of trust but can also be illegal, depending on the type of user (such as child, adult, patient, celebrity, government official) and context of usage. This is further complicated by the fact that, over time, a chatbot may get personalized to a user's need but the person may not want to share her personalized information with the developers. Another user's concern is the possible use of abusive language by the chatbot. Yet another issue could be the use of language that the user does not understand, because too complex or not appropriate for his/her knowledge of the subject. Developers: Developers want the chatbot to perform as per design and may worry that it could say something which gives an unintended perception to human users. Examples Figure 1 : Sample interaction of a user with a train assistance chatbot for Indian railways (Mishra, Gaurav, and Srivastava 2018 of developers' concerns are related to bias behavior (that is, the chatbot should not be prone to erratic response in the presence of protected variables like gender or race) and language usage (that is, the chatbot should not respond with hateful or abusive language). Data providers: They provide data that is used by chatbots' developers for the training phase, ranging from the domain of discourse (e.g., financial data), encyclopedic information (e.g., Wikipedia), to language (e.g., Word embeddings). Data providers want to make sure their data is of the best quality feasible.

In this paper, we address the setting where a chatbot cannot be modified and its training data cannot be accessed, and a neutral party wants to assess and communicate the trustworthiness of chatbots in the context of a user's priorities over the issues. Such a rating can help users choose among alternative chatbots, developers test their systems, business leaders price their offering, and regulators set policies. We envision a personalized rating methodology for chatbots that relies on separate rating modules for each issue, and users' detected priority orderings among the issues, to generate an aggregate personalized rating for the trustworthiness of a chatbot for a certain user profile. We focus on 4 issues: Fairness and bias (B), Information leakage (IL), Hate and Abusive language (AL), and Conversation Complexity (CC). Table 1 illustrates these issues on some dialog datasets. However, our framework and methodology is general and can be extended to other issues. We make the following contributions: (a) introduce the notion of a contextualized rating of the trustworthiness of a chatbot; (b) propose a method to compute the rating by using relative importance rankings over issues, provided by users; (c) present an architecture to implement the rating approach as a service; (d) integrate the method with a live chatbot; (e) evaluate our approach on four dialog datasets and representative user profiles which we validate in a user survey.

A dialog is made up of a series of turns, where each turn is a series of utterances by one or more participants playing one or more roles. For example, in the customer support setting, the roles are the customer and the support chatbot.

The core problem in building chatbots is that of dialog management (DM), i.e., creating useful dialog responses to the user's utterances (McTear, Callejas, and Griol 2016) . There are many approaches to tackle DM in literature, including finite-space, frame-based, inference-based, and statistical learning-based (Crook 2018; Clark, Fox, and Lappin 2010; Inouye 2004; Young et al. 2013) , of which finite-space and frame-based are the most popular ones with mainstream developers. In a representative invocation, the user's utterance is analyzed to detect her intent and a policy for response is selected. This policy may call for querying a database and the result of query execution is then used by the response generator to create a response, usually using some templates. The system can dynamically create one or more queries, which involves selecting tables and attributes, filtering values, and testing for conditions, and assume defaults for missing values. It may also decide not to answer a request if it is unsure of the correctness of a query's result. Note that the DM module may use one or more domain-specific databases as well as one or more domain-independent sources like language models and word embeddings. The latter has been found to be a possible source of human bias (Caliskan, Bryson, and Narayanan 2017) .

Trust is a very important factor in AI development, deployment, and usage. Users and stakeholders should be able to have a justified trust in the AI systems they use, otherwise they will not adopt them in their everyday life. In general, there are various dimensions of trust to be considered, that range from robustness to reliability, and from transparency to explainability and fairness. We focus on a subset of issues whose checkers are available and robust. Abusive Language: An important issue in the usage of a chatbot is the possibility of hate and abusive speech. This can make the chatbot unacceptable or inappropriate to some users, harm people in unintended ways, and expose service providers to unknown risks and costs. There is a growing body of work to detect hate speech (Davidson et al. 2017) and abusive language (Wang et al. 2014 ) online using words and phrases which people have annotated. The authors in the former paper define hate speech as language that is used to expresses hatred towards a targeted group or is intended to be derogatory, to humiliate, or to insult the members of the group. Their checker, which we use in our work, has a logistic regression with L2 regularization to achieve automatic detection of hate speech and offensive language. Bias: Another issue with AI services is the presence of bias. Bias can result in an unfair treatment for certain groups compared to others, which is undesirable and often illegal. There are many definitions of fairness, each one suitable for certain scenarios. (Github 2018) introduces notations and a few definitions of fairness while (Henderson et al. 2017 ) discusses bias in dialog systems. Also, (Galhotra, Brun, and Meliou 2017) introduces a software testing framework for bias where they use the notions of group and causal bias, and in (Bellamy et al. 2018 ) the authors describe a tool to explore fairness using a sample of data, methods, and criteria. In this paper, we use the bias checker implemented by (Hutto, Folds, and Appling 2015) . Information Leakage: This issue involves ensuring that information given by users to a chatbot is not released, even inadvertently, to other users of the same chatbot or the same platform. This is further complicated by the fact that over time, a chatbot may get personalized to a user's need but the person may not want to share her personalized information with the developers. Moreover, shared information may spread when other users interact (Kempe, Kleinberg, and Tardos 2003) over time. We use the framework discussed in (Henderson et al. 2017) to check this issue. Conversation Style and Complexity: This issue has to do with making sure that AI services interact with users in the most useful and seamless way. If a chatbot responds to user's questions with a terminology that the user is not familiar with, she will not get the required information and will not be able to solve the problem at hand.

In (Liao, Srivastava, and Kapanipathi 2017) , the authors propose a measure of dialog complexity to characterize how participants in a conversation use words to express themselves (utterances), switch roles and talk iteratively to create turns, and span the dialog. They measure the complexity of service dialogs at the levels of utterances, turns and overall dialogs. The method takes into consideration the concentration of domain-specific terms as a reflection of user request specificity, as well as the structure of the dialogs as a reflection of user's demand for (service) actions. We use their checker for implementation.

We consider a setting where a dialog system is rated for its behavior with a list of configurable k issues, such as bias (B), abusive language (AL), conversation complexity (CC), and information leakage (IL). The system is conceptually illustrated in Figure 2 . Its inputs are the issues to be considered, the details of the chatbot to be rated, a user profile, and the (query) datasets to use for the test, and its output is a rating for the chatbot, conveying its level of trustworthiness for a specific user or user's profile. We will now describe the modules and steps of the proposed rating methodology: 1. Get individual ratings from issue checkers. We assume that we have one checker available for each issue, which can rate the behavior of the dialog system on that issue on a 3level trust risk scale: [Low, Medium, High] (High meaning that the chatbot is not behaving well regarding that issue). For issues with raw scores in a continuous [0-1] range, we bin them into the 3-level scale. For k issues, we therefore get a list with k elements in [Low, Medium, High]. 2. Elicit/learn users' importance orders. The second step involves the aggregation of the elements of such a list into a single element from that same scale, in order to get a single rating for the trustworthiness of the chatbot. We propose to do that by asking users about the relative importance of the various issues.

Preference elicitation can be done by asking users about the relative importance of the various issues (individuallevel modeling) or capturing preferences of people as groups and validating them via surveys(profile-level modeling). Individual-level models are accurate but hard to build and manage (due to privacy considerations) and generalize. Profile-level models are representative of people who identify with them and easier to implement. Regardless of the granularity, we believe that trustworthiness is not an absolute property, but rather relative to each user (or user profile) of a chatbot. We build profile-level user models, validate these models using a survey, and test them on dialog datasets. New profiles can be added and existing ones updated based on survey responses to capture the preferences of the user base. 3. If elicitation is done at the user level, aggregate importance orders from similar users. We would like a single preference order, not many, so that we can then combine the rating levels of the various issues according to this single order and therefore get to a single trustworthiness rating for the chatbot. However, we do not want to aggregate over all users, but only over similar users, according to some notion of similarity. In this way, the rating will be personalized for each user group, which includes users that are similar to each other. The task is therefore to aggregate several ranked orders. To do this, one can use a voting rule, such as Plurality, Borda, Approval, Copeland, etc., as defined in voting theory (Levin and Nalebuff 1995) . We worked with user profiles, and hence, skipped this step. 4. Combine the collective importance order with the individual issue ratings.

We now move to combine this single importance order obtained in step 3 with the rating of the individual checkers on the issues, obtained in step 1. A very simple combination method could use the importance levels as weights for the individual ratings, and then could take the level (among Low, Medium, and High) which appears the most. For example, if we have our 4 issues (our B, AL, CC, IL), rated respectively L, M, M, H, and whose collective importance order is 1 (highest) for B (written Imp(B)), 2 for AL, 3 for CC, 4 for IL, we can count L three times (since 4 − Imp(B) = 4 − 1 = 3), M three times (since 4 − Imp(AL) + 4 − Imp(CC) = 2 + 1 = 3) times, and H zero times (since 4 − Imp(IL) = 4 − 4 = 0).

We also need to have a tie-breaking rule to choose among levels with the same score. For example, we could use an optimistic approach and choose the lowest level among those in a tie, or we could be pessimistic and choose the highest level. If we adopt a pessimistic approach, like we do in our implementation, in the above example we would select M (between L and M, that are in a tie) as the final rating for the chatbot trustworthiness. 5. Perform sensitivity testing.

The overall chatbot rating, obtained via the 4 steps procedure just outlined, could be sensitive to models, data, users or any combination thereof. To take this into account, we propose to check if the system has access to alternative learning models or training data to configure the chatbot, or to additional users. If so, each combination of them is used to rerun the procedure in order to get a new rating and check if the rating varies. The output thus can also assign a type of rating, conveying a: Trustworthy agent (Type-1), which starts out trusted with some score (L, M or H) and remains so even after considering all variants of models, data, and users; Model-sensitive trustworthy agent (Type-2), which can be swayed by the selection of a model to exhibit a biased behavior while generating its responses; Data-sensitive trustworthy agent (Type-3), which can be swayed by changing training data to exhibit a biased behavior; User-sensitive trustworthy agent (Type-4), which can be swayed by interaction with (human) users over time to exhibit a biased behavior; A sensitive agent (Type-N), which can be swayed with a combination of factors. Instantiating the method. While describing our methodology, the reader may have noticed that there are several dimensions along which we can make choices:

• the scale of the individual trust issue ratings (e.g., L M, • the elicitation or learning method to collect the importance orders from the users; • the granularity of user modeling. For user-level modeling, the similarity measure to define the user classes and the choice of the voting rule (e.g., Plurality, Borda, etc.); • the final aggregation method (e.g., linear combination and tie breaking rule). Moreover, a quantitative approach for the importance orders could allow for a higher precision in the final rating, and a less concise aggregation result may help in terms of explainability of the rating itself.

We illustrate our method with two common types of conversation systems, one for general chitchat and another one task-oriented. We will discuss trust issues, apply our method on these systems, and discuss the output. Eliza: This is a well-studied general conversation system created in the 1960s (Weizenbaum 1966; 1976) to model a patient's interaction with a Rogerian therapist. It uses cues from user's input to generate a response using pre-canned rules without deeper understanding of the text, or the context of the conversation (Manifestation 2006). Since Eliza uses pattern recognition on user's input, it can be easily manipulated via such text to become abusive (AL) and exhibit bias (B). Since the chatbot uses input text and scripted rules to create its response, it preserves the conversation style of the input, thus behaving well in terms of language complexity (CC). Finally, since it retains no context of a conversa-tion, thus two users giving the same inputs will get the same response, leading to no information leakage (IL). The output of the rating method for an Eliza implementation will be an aggregated trustworthiness score (L, M or H) and an explanation of how it was calculated from raw issue scores. Since this chatbot can be configured with alternative users, the system can check the chatbot for rating sensitivity and include the result in the output. Train Delay Assistant: This is a prototype chatbot meant to help travelers gather knowledge about train delays and their impact on travel in India (Mishra, Gaurav, and Srivastava 2018) . The Indian train system, which is the fourth largest transport network in size in the world, carrying over 8 billion passengers per year, has endemic delays. Hence, it is important for users to be aware of such delays in order to better plan their trips. The chatbot allows users to gain temporal and journey insights for trains of interest for anytime in future. It detects intent from a user's input to find train, time and stations of interest, and estimates delay using prelearned models, and finally produces a response 2 .

Given the nature of the domain, this chatbot is expected to not use a language that a user may consider inappropriate (AL). It is also expected to produce an output that does not exhibit bias towards a protected variable like gender of the user (B). The chatbot can exhibit a range of conversation styles on station names, train numbers and time which the user may perceive simple or complex. For example, reference to train stations can be by station codes (e.g., HWH) or their complete name (Howrah Junction), or even colloquial names (Howrah). Similarly, reference to train can be by codes (e.g., 12312) or names (e.g., Kalka Mail), and time variants (e.g, exact minutes or coarser time units) can create a variety of choices. Information leakage (IL) is also an important consideration, since users may not want to reveal their travel plans, especially when they are looking to use the delay information to make train reservations on trains whose seats are in high demand.

Just like for Eliza, the output of the rating method for the train chatbot will be an aggregated trustworthiness score (L, M or H) and an explanation of how it was calculated from raw issue scores. Sensitivity analysis can be done by configuring the train chatbot with various learned models of train delays, training data of trains, and users.

In the previous section, we demonstrated our rating method with conversation systems. We now describe an implementation of our trust rating approach using trust issue checkers that are publicly available. We conduct rating experiments with this implementation and report on insights gained. To model users, we define user profiles of people that share a common ordering of issues' importance and validate them with a user survey. For sensitive testing, we test our chatbot rating approach over different user profiles. We have integrated out approach with a typical chatbot that retrieves information from a database matching user query. For experiments, we use public dialog corpora as proxy for large chatbot conversations. We will show that the proposed approach can reveal issues with chatbots and help with their wider adoption.

We use four datasets spanning conversations in service domains where chatbots are deployed. Three are publicly available while one is proprietary. Public -Ubuntu technical support(# = 3,318): This corpus is taken from the Ubuntu online support IRC channel, where users post questions about using Ubuntu. We obtained the original dataset from (Lowe et al. 2015) , and selected 2 months of chatroom logs. We extracted 'helping sessions' from the log data, where one person posted a question and other user(s) provided help. The corpus contain both dyadic and multi-party dialogs. Public -Insurance QA (# = 25,499): This corpus contains questions from insurance customers and answers provided by insurance professionals. The conversations are in strict Question-Answer (QA) format (with one turn). The corpus is publicly available (Feng et al. 2015) . Proprietary -Human Resource bot (# = 3,600): This corpus is collected from an internal company's deployment of an HR bot -a virtual assistant on an instant messenger tool that provides support for new hires. Although the bot does not engage in continuous conversations (i.e., carrying memory), it is designed to carry out more natural interactions beyond question-and-answer. For example, it can actively engage users in some social small talk. Public -Restaurant reservation support (# = 2,118): This corpus contains conversations between human users and a simulated automated agent that helps users find restaurants and make reservations. The corpus was released for the Dialog State Tracking Challenge 2. (Henderson, Thomson, and Williams 2014) .

For users, instead of collecting importance level ordering for issues from individuals and then aggregating them, we consider user profiles, which represent rankings of issues for typical users. To define the profiles, we proposed issue rankings for each profile and then we validated them via a crowd-sourcing approach. The profiles we considered are: Conversation style oriented users (P CU ): They represent users experienced in people-to-people conversation, but less with chatbots or with English, like seniors or non-native English speakers, for whom we presume that conversation style is important. The importance level ordering is defined as (high to low): CC, AL, B, IL. Fairness-oriented users (P F U ): As name suggests, they represent users concerned mostly about equal treatment of people. We define their issue ranking as: B, CC, AL, IL. Privacy-oriented users (P P U ): They represent users predominantly concerned with information leakage. We define their issue ranking as: IL, AL, B, CC. Abusive language oriented users (P AU ): They represent users with limited experience with conversations and are vulnerable, like children, and for whom abusive language and conversation style are important for adopting technology. We define their issue ranking as: AL, CC, B, IL.

We validated the 4 user profiles described above by surveying 20 people, of which 5 are chatbot/NLP researchers, 2 are regular chatbot users, 12 are casual chatbot users, and 1 is an NLP practitioner (as declared by them). We asked each person to write their importance order over the 4 issues, to validate the 4 profiles (by confirming the proposed order or by writing a counter-proposal), and to tell us about possible additional issues or profiles to be considered.

For all the four profiles, we then combined the results from people using Borda count voting method. The results aligned with our proposed orders for each of the user profiles, thus validating our assumption. The percentage of people who agreed with the proposed orders is shown is the chart below:

Additional profiles that were mentioned are technologysavy young people, online shoppers, and non-native English speakers. The preference order for these profiles were already captured in above four profiles therefore no new profiles were created. Many also suggested to consider chatbot accuracy and usefulness as additional trust issues.

One can extend this work by conducting more extensive surveys based on above insights.

For bias detection, we used the sentence-level bias detection framework discussed in (Hutto, Folds, and Appling 2015) . In this implementation, given a sentence as input, the bias checker extracts the structural and linguistics features, such as sentiment analysis, subjectivity analysis, modality, the use of factive verbs, hedge phrases, and computes perception of bias based on a regression model trained on these features. The model was trained using news data and the output was on the scale of 0 to 3 where 0 denotes an unbiased behavior and 3 denotes an extremely biased behavior, respectively. We scaled the output from 0 to 1 to conform with the scale of the outputs from other checkers. Table 1 illustrates and Table 2 (left) reports the bias score of the datasets in aggregate. We see that scores are low but the datasets are not free of bias.

For the detection of abusive language, we used the method proposed in (Davidson et al. 2017) . The checker gives a 3value output (Hate Speech, Offensive Language, and Neither) which are summed with weights to arrive at the final score. Table 1 illustrates and Table 2 (middle) shows the distributions of the scores for each dataset. Note that no Hate Speech was found in the Restaurant corpus.

For information leakage, we use the privacy checker framework discussed in (Henderson et al. 2017) . We augment the data with 10 input-output pairs (keypairs) that represent sensitive data, which the model should keep secret. We then train a simple seq2seq dialogue model (Vinyals and Le 2015) on the data and measure the number of epochs at which the model achieves more than 0.5 accuracy of eliciting the secret information. We tested two cases (1) when both input and output contained sensitive information, and (2) when output contained sensitive information and input contained datatype of sensitive information. The model achieved similar results for both cases. We used case (1) for the prototype implementation. We mapped the number of epochs to 0 to 15 (0), 15 to 30 (0.5), and above 30 (1) respectively. For the Ubuntu data-set we could not run the experiment as it was a multi-way communication with major assumptions needed to form input output pairs. For that data, we adopted a pessimistic approach and took the privacy issue rating to be (0.5).

For dialog complexity, we use the complexity checker implemented by (Liao, Srivastava, and Kapanipathi 2017) . Table 1 illustrates and Table 2 shows the complexity scores Table 2 . For each corpus and profile, the raw scores for the four trust issues for each dialog corpus are aggregated according to user profile importance. Table 3 shows the final result. From Table 2 , we see that the four considered datasets are not biased (L) and abusive (L), but can be conversationally complex and leak information (have M or H values). From Table 3 , we see that the issue ratings for dialog corpus vary with user profile. Profiles that considered fairness and abuse as important see no difference in ratings (P F U and P AU ). For conversation-oriented users (P CU ), conversation complexity was important and the domains of insurance and restaurant have M (medium) rating for them. For privacyoriented users (P P U ), the insurance domain has least cause of concern while HR and Restaurant can be problematic, since they get an H ratings.

Since overall ratings change with user profiles, the datasets, as proxy of corresponding chatbots, show that the agents are User-sensitive trustworthy (Type-4).

We have integrated our implementation with a chatbot that recommends hospitals given user's query about medical services and location using open data. Figure 3 show a snapshot of interaction.

The chatbot also exposes a REST interface using which, we have integrated our prototype rating implementation. The rating user, using a command-line interface, can input their preferences over issues or select a user profile. As the conversation between the rating user and chatbot progresses, is- Figure 3 : User-interface of a chatbot for exploring hospitals. Our approach is integrated as a commandline utility using its REST interface.

sue checkers can compute partial results and generate aggregated ratings on utterances (AL, CC) and also provide final ratings at the end of a dialog or session (AL, CC, B, IL). Figure 4 show a snapshot of interaction for AL issue. If the rating user does not want to review rating per conversation, the rating user can also select a data generator for specific issues. For AL, we use the labeled abuse data from (Davidson et al. 2017) in the data generator; still, we do not condone offensive language which is used here only for illustration.

The brittleness of machine learning models is well known, and chatbots represent a specific usage of such models. For NLP models, (Ribeiro, Singh, and Guestrin 2018) presented a method to generate semantically equivalent test cases that can flip the prediction of learning models. This can be a cause of concern to application developers for usability reasons, and addressing it can prevent the system from being exploited by an adversary. In the context of chatbots, the concern is that its output can be manipulated by changing the input and this is what the sensitivity test can detect. The authors in (Henderson et al. 2017 ) systematically survey a number of potential ethical issues in dialogue systems built using learning methods. However, they do not consider a method to communicate a trustworthiness rating based on the analysis of such issues.

There is a rich body of work on studying issues influencing online services and AI methods. In information spreading, the seminal work is described in (Kempe, Kleinberg, and Tardos 2003) where the authors looked at the spread of information in social networks and how to maximize it by engaging the effective influencer nodes. In studying abusive language online, the authors in (Wang et al. 2014 ) explore the prevalence of cursing on Twitter which serves as a platform for utterance and conversation. They found that people curse more online than in physical environment, among same gender, when they are angry or sad, as their activities increase during the day, and when in relaxed or formal environments. But users may not want the chatbots they are interacting with to exhibit the same behavior, especially when the users are children.

The closest prior work is on rating AI services (Srivastava and Rossi 2018) . There, the authors propose a 2-step bias rating procedure for invocable one-shot AI services (like translation service), as well as a composition method to build sequences of such services. However, that work does not consider: (a) multiple issues and users, (b) personalized rating based on users' ranking of issues, (c) dialog setting of a series of interactions, and not just a single invocation, and (d) conversations leading to completion of tasks.

We considered the problem of rating chatbots for trustworthiness based on their behavior regarding ethical issues and users' provided trust issue rankings. We defined a general approach to build such a rating system and implemented a prototype using four issues (abusive language, bias, information leakage, and conversation style). We illustrated it with two chatbot examples and experimented with four dialog datasets. We built user profiles to elicit user preferences about important trust issues and validated them with surveys. The experiments show that the rating approach can reveal insights about chatbots customized to user's trust needs.

We believe that this work is a stepping stone towards general, modular, and flexible trust rating approaches for conversation systems. It is only by building justified trust that user, developers and data providers can benefit from, and contribute to, the use of chatbots for improved and more informed decisions. 

Chatbots in customer service

Ai fairness 360: An extensible toolkit for detecting, understanding, and mitigating unwanted algorithmic bias

Semantics derived automatically from language corpora contain human-like biases

Handbook of computation linguistics and natural language processing

Statistical machine learning for dialog management: its history and future promise

Automated hate speech detection and the problem of offensive language

Applying deep learning to answer selection: A study and an open task

Fairness testing: Testing software for discrimination

Fairness -notation, definitions, data

Ethical challenges in data-driven dialogue systems

Computationally detecting and quantifying the degree of bias in sentence-level text of news stories

Minimizing the length of non-mixed initiative dialogs

Association for Computational Linguistics

Maximizing the spread of influence through a social network

An introduction to vote-counting schemes

A measure for dialog complexity and its application in streamlining service operations

The ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn dialogue systems

Conversational interfaces: Past and present

A train status assistant for indian railways

Automation, algorithms, and politics-talking to bots: Symbiotic agency and the case of tay

Semantically equivalent adversarial rules for debugging nlp models

Towards composable bias rating of ai systems

AAAI/ACM Conference on AI Ethics and Society Conference (AIES 2018)

Cursing in english on twitter

Eliza a computer program for the study of natural language communication between man and machine

Computer Power and Human Reason: From Judgment to Calculation

Pomdp-based statistical spoken dialog systems: A review