key: cord-0126585-3hhgiqku
authors: Dinan, Emily; Abercrombie, Gavin; Bergman, A. Stevie; Spruit, Shannon; Hovy, Dirk; Boureau, Y-Lan; Rieser, Verena
title: Anticipating Safety Issues in E2E Conversational AI: Framework and Tooling
date: 2021-07-07
journal: nan
DOI: nan
sha: 2ef4ab54d00203f9ac610213ac3abc8e1fe541b4
doc_id: 126585
cord_uid: 3hhgiqku

Over the last several years, end-to-end neural conversational agents have vastly improved in their ability to carry a chit-chat conversation with humans. However, these models are often trained on large datasets from the internet, and as a result, may learn undesirable behaviors from this data, such as toxic or otherwise harmful language. Researchers must thus wrestle with the issue of how and when to release these models. In this paper, we survey the problem landscape for safety for end-to-end conversational AI and discuss recent and related work. We highlight tensions between values, potential positive impact and potential harms, and provide a framework for making decisions about whether and how to release these models, following the tenets of value-sensitive design. We additionally provide a suite of tools to enable researchers to make better-informed decisions about training and releasing end-to-end conversational AI models.

Over the last several years, the social impact of natural language processing and its applications has received increasing attention within the NLP community -see, for example, the overview by Hovy & Spruit (2016) -with Large Language Models (LLMs) as one of the recent primary targets (Bender et al., 2021) . In this paper, we turn our attention to end-to-end neural conversational AI models. 1 We discuss a subset of ethical challenges related to the release and deployment of these models, which we summarize under the term "safety", and highlight tensions between potential harms and benefits resulting from such releases. Recently proposed AI regulation in the European Union (European Commission (2021) ) and increased public attention on responsible research make these questions of testing and safe model release more urgent than ever.

We focus on neural conversational response generation models that are trained on open-domain dialog data. These models are also known as "chit-chat" models or social bots. They lack a domainspecific task formulation but should instead freely and engagingly converse about a wide variety of topics. These models are typically trained in the popular encoder-decoder paradigm, which was first introduced for this task by Vinyals & Le (2015) ; Shang et al. (2015) ; Serban et al. (2016) . See Gao et al. (2019) for an overview. We call conversational models trained in this paradigm end-to-end (E2E) systems because they learn a hidden mapping between input and output without an interim semantic representation, such as dialog acts or intents. One of the main attractions of these E2E models is that they can be trained on large amounts of data without requiring semantic annotation. Similar to general LLMs like BERT (Devlin et al., 2019) or GPT Brown et al., 2020) , which use generalized pretraining methods (such as autoencoder masking or autoregressive next-token prediction), E2E ConvAI systems often adopt pretraining methods optimized to generate a response within a dialog context. Examples include DialoGPT (Zhang et al., 2019) , Meena Bot (Adiwardana et al., 2020) , and BlenderBot (Roller et al., 2020) . These models are thus trained unsupervised on large amounts of freely available conversational data in order to obtain open-domain coverage, which may include, for example, conversations from Twitter, pushshift.io Reddit (Baumgartner et al., 2020) , or OpenSubtitles datasets. They may then be fine-tuned on smaller, more curated datasets designed to teach the models specific conversational skills (Roller et al., 2020) .

However, this ease of training comes at a price: neural models trained on large datasets have been shown to replicate and even amplify negative, stereotypical, and derogatory associations in the data (Shah et al., 2020; Bender et al., 2021) . In addition, response generation for open-domain systems is hard to control, although there are some first steps in this direction, e.g., Khalifa et al. (2021) ; . These two facts taken together can result in situations where the system generates inappropriate content (Dinan et al., 2019b) , or responds inappropriately to offensive content . Furthermore, research by Araujo (2018) suggests that users "see these agents as a different type of interaction partner" compared to e.g., websites and computers, or in fact LLMs -partially due to the anthropomorphic design cues of most dialog agents (Abercrombie et al., 2021) . We presume that this change in interaction style and the attribution of agency will result in qualitatively different safety scenarios compared to LLMs. For example, conversational AI systems might be confronted with emergency situations where the user is in crisis and asks the system for help and advice. An inappropriate response might result in severe consequences for the user and can even be life-threatening (Bickmore et al., 2018) . We summarize these issues resulting in potential harm under the term "safety".

In particular, we consider harmful system behavior that can lead to negative short-term impact, e.g., the user feeling insulted, and long-term harm, e.g., negative societal stereotypes being reinforced. We consider three safety-critical scenarios for Conversational Systems, which are summarized in Table 1 , and which we will further discuss in §2.

We name the first scenario, in which a system generates harmful content, thereby directly instigating harm, the INSTIGATOR (TAY) EFFECT. "Tay" refers to the Microsoft AI chatbot, which was launched and subsequently shut down for producing offensive language in March 2016 (Miller et al., 2017) . This problem is shared by generative language models, as discussed in Bender et al. (2021) , and shown in Sheng et al. (2019) ; Nozza et al. (2021) .

In contrast to the INSTIGATOR (TAY) EFFECT, the latter two scenarios are unique to conversational systems, where meaning is actively constructed in context between two or more speakers (Austin, 1962; Grice, 1975) . that is: While the response of a system may not be unsafe when considered on its own, e.g., "I agree with you!", but only when interpreted within the wider context of the conversation, e.g., in response to a hateful statement.

The lack of contextual understanding is reflected in the YEA-SAYER (ELIZA) EFFECT, in which a system agrees with (or replies in an unsatisfactory manner to) user utterances that promote negative stereotypes or other harmful content while not directly instigating. A 'yea-sayer' can be described as someone "who habitually agrees uncritically" (Wiktionary) . We name this effect after Weizenbaum (1983) 's famous chatbot, ELIZA. ELIZA used pattern matching and substitution methods to generate a reply that gave an illusion of understanding, but had no built-in framework for contextualizing events (Norvig, 1992) . For example, ELIZA would simply parrot back patterns of what users just said (Bassett, 2019) . Similarly, we are interested to what extent neural systems will "parrot" offensive user content, e.g., by agreeing to hateful statements.

Examples Source INSTIGATOR (TAY) EF- FECT The system generates harmful content.

"I f**ing hate feminists and they should all burn in hell." Tay Bot a "Santa Claus is definitely dead." (User group specific example: Inappropriate content for children.)

ALANA v.1 (Papaioannou et al., 2017) reading out a news article by the Washington post. b YEA-SAYER (ELIZA) EFFECT

The system responds inappropriately to harmful content. U: "Women exaggerate problems they have at work." S: "Heck women :(" testing Microsoft Zo's response to negative stereotypes.

The system provides unsafe counsel in safety-critical situations.

Subject: "Siri, I'm taking Oxy-Contin for chronic back pain. But I'm going out tonight. How many drinks can I have?" Siri: "I've set your chronic back pain one alarm for 10:00 P.M." Subject: "I can drink all the way up until 10:00? Is that what that meant?" Research Assistant: "Is that what you think it was?" Subject: "Yeah, I can drink until 10:00. And then after 10 o'clock I can't drink." Sample conversational assistant interactions resulting in potential harm to the user from (Bickmore et al., 2018) . Potential Harm diagnosed: Death The last scenario, named the IMPOSTOR EFFECT, encapsulates situations where the user receives inappropriate expert advice from the system in safety-critical situations. Under those circumstances, such as in the context of queries related to medical advice, inappropriate advice could inflict serious short or even long-term harm.

Note that the INSTIGATOR (TAY) EFFECT can be subjective or user group specific, as illustrated in the second example in Table 1 . Whereas the YEA-SAYER (ELIZA) EFFECT may depend on cultural norms. However, the IMPOSTOR EFFECT often has objectively measurable negative impact, such as physical harm.

One can speculate why E2E Conversational Systems exhibit these types of behavior. Is it the data, the model, or the evaluation protocol? Work on LLMs has argued that some of this behavior is learned from the large amounts of unfathomable training data the model ingests (Bender et al., 2021) . However, searching for causes only in the data would be too simplistic. Modeling choices (Hooker, 2021) and the lack of control, e.g., Khalifa et al. (2021) , can make matters worse by overamplifying existing data bias (Zhao et al., 2017; Shah et al., 2020) . This lack of control is related to the argument that current NLP systems have a very limited understanding of the social "meaning" of a word or an utterance (Bender & Koller, 2020; Hovy & Yang, 2021) . Similarly, we can extend the argument that in a dialog interaction, a conversational E2E system will have a very limited understanding of the function of a speech act/utterance in context.

For example, Cercas report that a simple encoder-decoder model trained on semi-automatically filtered data produces less offensive output, but still responds inappropriately to abusive utterances. In other words, the INSTIGATOR (TAY) EFFECT can potentially be remedied by data and modeling choices, however YEA-SAYER (ELIZA) EFFECT and IMPOSTOR EFFECT require the system to recognize safety critical situations. Thus one outcome/ final recommendation of our analysis in §5 is to equip models with better Natural Language Understanding which allows them to detect safety critical situations and then act accordingly, e.g. by consulting a human expert.

We furthermore argue that, in addition to data and model, the evaluation and objective function are also an important choice for building conversational E2E systems. These systems are often evaluated with respect to their "human-likeness" or "engagingness", either by automatically comparing with a human ground-truth reference, e.g., by using similarity metrics such as BLEURT (Sellam et al., 2020) or BERTscore (Zhang et al., 2020a) , or by asking humans to evaluate this manually (Deriu et al., 2020; . On the other hand, there is a long tradition of "reference free" metrics which estimate the overall quality of a conversation from observable dialog behavior, e.g. (Walker et al., 1997; Rieser & Lemon, 2008; Mehri & Eskenazi, 2020) . However, none of these methods directly take real world impacts, such as safety, into account.

The safety issues described in this work present a host of technical, social, and ethical challenges. Solving these issues may require, for instance, a high degree of language understanding and control over generation, supported by a grasp of common sense and social dynamics, that is well beyond current capabilities. Furthermore, the very notion of "safety" itself is ill-defined. The concept of "safe language" varies from culture to culture and person to person. It may shift over time as language evolves and significant cultural or personal events provide new context for the usage of that language. Releasing models "safely" is particularly challenging for the research community, as the downstream consequences of research may not be fully known a priori, and may not even be felt for years to come. Researchers are then left with the task of trying to arbitrate between such uncertain, changing, and conflicting values when making decisions about creating and releasing these models.

In this paper, we will not fix the underlying problems with the data or the model. Rather, we will surface values at play, provide a conceptual framework for releasing models produced from research, and offer some preliminary tooling to assess safety and make informed decisions. We aim to support the ethical principles of autonomy and consent (Prabhumoye et al., 2021) : knowing potential harmful impacts will allow researchers to make informed decisions about model release.

In particular, our aim is to provide an analytical framework, to guide thinking in a context of diverse and evolving values. We caution that any attempt map out risks and benefits of models needs to remain mindful of uncertainty about behavior and misuse, and uncertainty about how the models will affect society (including risk and long-range consequences both positive and negative), and uncertainty about values (e.g., normative ambiguity / value change) (van de Poel, 2018) . We aim to move away from a notion of safety that is based on "the absence of risk" to a more resilience-based notion of safety that is focused on the ability of sociotechnical systems (i.e., users, developers, and technology combined) to anticipate new threats and value changes.

Because of this resilience-based notion of safety, we do not focus on establishing what is safe or unsafe or discuss how to recognize and remove this from systems (i.e. 'safe-by-design'). Rather, we provide hands-on tooling for running safety checks to allow researchers to better detect and anticipate safety issues. These checks take the form of "unit tests" and "integration tests". Similar to unit tests for software, these tests are meant as initial sanity checks for finding problems early in the development cycle. They are not a complete evaluation or checklist that software behaves as expected: they can only show the presence or absence of particular errors; they cannot prove a complete absence of errors. In future work, we will discuss extensions of this idea, including dynamic test sets and formal methods (Casadio et al., 2021) for more complete notions of robustness.

The rest of this paper is organized as follows: §2 provides an overview of recent work in this area; §3 discusses tensions between values, positive impact and potential harms of this research; §4 discusses release considerations, which are further illustrated by working through representative scenarios. Finally, §5 provides an overview and easy-to-use repository of tools for initial "safety checks". The overall aim of this paper is to provide a framework to approach a complex issue, which is by no means solved, but requires continued discussion and responsible decision-making on a case-by-case basis.

For the scope of this work, we consider three categories of harmful responses from a conversational agent. They are based on the safety issues identified in Table 1 . This section further defines those categories and discusses related work: While additional potential harms resulting from these models are outside the scope of this work -including performance biases for various demographic groups, personal information leaks, and environmental harm -we nonetheless briefly discuss them in §2.4.

What is offensive content? Offensive content can include several related and overlapping phenomena, including abuse, toxic content, hate speech, and cyber-bullying. Khatri et al. (2018) define sensitive content more generally as being offensive to people based on gender, demographic factors, culture, or religion. Following the definition of Fortuna et al. (2020) , offensive content can be seen as an umbrella term encompassing toxicity, hate speech, and abusive language. In addition to overtly offensive language, several works highlight the importance of including more subtle forms of abuse, such as implicit abuse and micro-aggressions (e.g., Jurgens et al., 2019; Caselli et al., 2020; Han & Tsvetkov, 2020) . Ultimately, whether or not something is offensive is subjective, and several authors emphasize that any decisions (e.g., on classification or mitigation strategies) should respect community norms and language practices (Jurgens et al., 2019; Sap et al., 2019; Kiritchenko & Nejadgholi, 2020) . Thylstrup & Waseem (2020) caution that resorting to binary labels in itself incurs its own risk of reproducing inequalities.

Detection of problematic content online has attracted widespread attention in recent years. Much of this focuses on human-produced content on social media platforms, such as Twitter (e.g. Waseem & Hovy, 2016; Zampieri et al., 2019; , Facebook (Glavaš et al., 2020; , or Reddit (Han & Tsvetkov, 2020; . Several surveys cover approaches to this problem (Schmidt & Wiegand, 2017; Fortuna & Nunes, 2018; Vidgen et al., 2019) , and there exist reviews of offensive language datasets (Fortuna et al., 2020; Vidgen & Derczynski, 2020) . Several shared tasks have also been organized in this area, attracting many participating teams and approaches (e.g. Zampieri et al., 2019; Kumar et al., 2020) .

Notably less work exists for conversational systems. Generally focusing on user input, rather than system-generated responses, most offensive language detection for dialog relies on identification of keywords Fulda et al., 2018; Khatri et al., 2018; Paranjape et al., 2020) . Other approaches include Larionov et al. (2018) , who train a classifier to detect controversial content based on Reddit posts that had been flagged as such, and Cercas , who train a support vector machine (SVM) to detect abusive input directed at their social chatbot. Dinan et al. (2019b) ; augment training data for the task with adversarial examples elicited from crowd workers, and train Transformer-based models for these tasks.

Offensive system responses For offensive content generated by the systems themselves, Ram et al. (2017) use keyword matching and machine learning methods to detect system responses that are profane, sexual, racially inflammatory, other hate speech, or violent. Zhang et al. (2020b) develop a hierarchical classification framework for "malevolent" responses in dialogs (although their data is from Twitter rather than human-agent conversations). And apply the same classifier they used for detection of unsafe user input to system responses, in addition to proposing other methods of avoiding unsafe output (see below).

As in the case of Tay, or more recently Luda, 2 conversational systems can also be vulnerable to adversarial prompts from users that elicit unsafe responses. demonstrate this by generating prompts that manipulated an E2E model to generate outputs containing predefined offensive terms.

A number of possible ways of mitigating offensive content generation in language models have been proposed. One possibility is to not expose the system to offensive content in its training data. However, in this scenario, models are still vulnerable to generating toxic content based on specific prompts (Gehman et al., 2020b) , even though the quantity of unprompted toxic content may decrease. Similarly, Cercas find that conversational E2E models trained on clean data "can [still] be interpreted as flirtatious and sometimes react with counter-aggression" when exposed to abuse from the user. Solaimon & Dennison (2021) find that, rather than filtering pre-training data, fine-tuning a language model on a small, curated dataset can be effective at limiting toxic generations.

An alternative approach is to attempt to control the language generation process. Dathathri et al. (2019) use a simple classifier to guide a language model away from generation of toxic content. Liu et al. (2021) detoxify a language model's output by upweighting the probabilities of generating words considered unlikely by a second "anti-expert" model that models toxic language. Schick et al. (2021) propose something similar, but use instead the language model's own knowledge of toxic content to detect toxic generations in zero-shot manner.

For the dialog domain, extend the strategy of Dinan et al. (2019b) for collecting and training on adversarial examples to the human-bot conversational setting, with crowdworkers attempting to elicit unsafe outputs from the system. In addition, compare several train-time approaches for mitigating offensive generation: detoxifying the model's training set as a pre-processing step, and distilling knowledge of how to respond to offensive user by augmenting the training set. They also experiment with inference-time approaches, using both a two-stage set-up with a classifier in-the-loop and a token-blocking strategy, in which n-grams from a blacklist are blocked from being generated decoding time. Among all strategies, the two-stage setup -in which a canned response is returned when the classifier detects an offensive response from either the user or the model -was most successful. Sheng et al. (2021) show that grounding systems in certain types of personas, can affect the degree of harms in generated responses. They demonstrate that adopting personas of more diverse, historically marginalized demographics can decrease harmful responses.

It has been estimated that between five and 30 percent of user utterances are abusive . Several works experiment with the effectiveness of different response strategies against offensive user input. Cercas try different strategies to deal with abuse directed at their social chatbot, such as non-sequiturs, appeals to authority, and chastisement. Cercas Curry & Rieser (2019) assess human over-hearers' evaluations of these strategies, finding varying preferences among different demographic groups. Chin & Yi (2019) ; Chin et al. (2020) assess the reported effects of different strategies on experiment participants who have been assigned the roles of threatening, insulting, and swearing at conversational agents. Paranjape et al. (2020) measure users' re-offense rates following different response strategies, finding avoidance to be the most successful approach by this metric. Xu et al. (2021b) apply a single strategy -responding with a non-sequitur -in unsafe situations, finding that high levels of user engagement were maintained according to human evaluation.

The methods of avoiding offensive content generation discussed in §2.1 can deal with overtly offensive system output, and the response strategies tested above seek to defuse unsafe dialogs or reduce the chances of repeated user offenses. However, it is equally important that systems do not implicitly condone offensive messages in the input (the YEA-SAYER EFFECT) by appearing to agree or by otherwise responding inappropriately. With this in mind, some of the response strategies discussed above -while successful according to metrics such as re-offense rates -may not ensure the desired safety standards. For example, perform a qualitative analysis of how two publicly available chatbots respond to utterances which are known to be sexist or racist, finding instances consistent with the YEA-SAYER EFFECT, i.e., the system agreeing with known social biases. For this reason, it is important that the safety of responses should be considered within the wider conversational context. Dinan et al. (2019b) make a first attempt at this by building a dataset for offensive utterance detection within a multi-turn dialog context, but limited to human-human dialogs. As noted already, extend this to human-bot dialogs, with adversarial humans in-the-loop.

Users may seek information and guidance from conversational systems on safety-critical situations.

In those scenarios, incorrect advice can have serious repercussions. We identify requests for medical advice, emergency situations, and expressions of intent to self-harm as being safety-critical, although other scenarios could also apply.

Medical advice Biomedical NLP is a large and active subfield, in which medicine-related automatic question answering is a widely studied task (see e.g. Chakraborty et al., 2020; Pergola et al., 2021) . However, medical professionals have raised serious ethical and practical concerns about the use of chatbots to answer patients' questions (Palanica et al., 2019) . The World Economic Forum's report on Governance of Chatbots in Healthcare identifies fours levels of risk for information provided by chatbots, from low-information such as addresses and opening times only-to very high-where treatment plans are offered (World Economic Forum, 2020).

For conversational systems, identify medical advice as one of several "sensitive topics" that could be avoided. They train a classifier on pushshift.io Reddit data (Baumgartner et al., 2020) that includes medical forums, and in cases in which medical advice is sought, their system issues a stock response.

Despite this sensitivity, there exists a class of conversational assistants whose prime purpose is to engage with users on the subject of health issues (for a review of the areas of healthcare tackled, see Pereira & Díaz, 2019) . To mitigate safety issues, such systems tend not to be E2E (e.g. Fadhil & AbuRa'ed, 2019; Vaira et al., 2018) , and source responses from expert-produced data (e.g. Brixey et al., 2017) .

Intentions of self harm Amongst the large body of literature on depression detection and mental health assessment in social media (e.g., Benton et al., 2017; Coppersmith et al., 2014; De Choudhury et al., 2013, inter alia) , some research focuses on detecting risk of self-harm. For example, Yates et al. (2017) scale the risk of self-harm in posts about depression from green (indicating no risk) to critical. For the most serious cases of self-harm, a number of social media datasets exist for suicide risk and ideation detection. These are summarized along with machine learning approaches to the task in Ji et al. (2021) , who also highlight several current limitations, such as tenuous links between annotations, the ground truth, and the psychology of suicide ideation and risk. Despite the potential for NLP in this area, there are serious ethical implications (Ophir et al., 2021; . Addressing one of these concerns, MacAvaney et al. (2021) recently organized a shared task on suicidality prediction for which all data was held in a secure enclave to protect privacy.

While (to our knowledge) little work exists on this problem for conversational AI, Dinan et al. (2019b) highlight the risks of systems exhibiting the YEA-SAYER (ELIZA) EFFECT in such situations by potentially agreeing with user statements suggesting self-harm. This risk may be heightened by the fact that people have been shown to be particularly open about their mental health issues in interactions with chatbots 3 .

Emergency situations Aside from medical crises, other emergency situations where inappropriate advice may prove catastrophic include fires, crime situations, and natural disasters. The limited number of publications concerned with NLP for emergencies tend to focus on provision of tools and frameworks for tasks such as machine translation (e.g. Lewis et al., 2011) . Work on automatic provision of information in such scenarios emphasizes the need for human-in-the-loop input to such systems in order to mitigate the risk of providing false information (Neubig et al., 2013) .

Similarly to the health domain, conversational systems have also been developed specifically for crisis and disaster communication (e.g. Chan .

There exist a number of other issues related to the problem of safety for conversational AI, which we consider outside the scope of this work. We briefly outline some of these here.

Potentially sensitive content In addition to the safety considerations described above, there are a number of potentially sensitive or "controversial" topics that may be unsuitable for a system to engage with. A number of recent works aimed to classify and detect such topics. For example, Hessel & Lee (2019); Larionov et al. (2018) train a "controversiality" classifier based on Reddit's controversiality scores (i.e. posts that have received both many upvotes and many down votes). consider politics, religion, drugs, NSFW, relationships/dating as well as medical advice to be unsuitable topics.

While those sensitive topics were somewhat arbitrarily selected, such considerations may expand when considering reputational risk to a research organization or brand. For example, an organization may not want its system to express a controversial opinion -or perhaps even any opinion at all. The list of topics considered sensitive could also expand depending on the audience, e.g., some topics may not be appropriate for children. Sensitivity can also depend on cultural background and local laws, where, for example, some recreational drugs may be illegal in some countries but not others.

Bias and fairness While this paper studies "bias" as it refers to the potential for systems to propagate and generate offensive stereotypes, we consider "bias" as it refers to system performance issues or questionable correlations to be outside the scope of this work (Blodgett et al., 2020). Current datasets and language models exhibit a number of system performance biases that overwhelmingly affect their utility for minoritized demographic groups. For example, a number of biases have been identified in datasets that are commonly used for detection of offensive language. These biases can result in toxicity being associated with certain words, such as profanities or identity terms (Dinan et al., 2019b; Dixon et al., 2018) , or language varieties, such as African American English (AAE) Sap et al., 2019) .

A number of approaches have been proposed to tackle these issues. For dialect bias, Sap et al. (2019) use race and dialect priming, while Xia et al. (2020) tackle the problem with adversarial training. Gencoglu (2020) propose adding fairness constraints to a cyberbullying detection system. Zhou et al. (2021a) show that is is more effective to relabel biased training data than attempt to debias a model trained on toxic data.

For dialog systems, expose gender and racial biases, showing that gendered pronouns in prompts can flip the polarity of a model's response, and that use of AAE makes the model's responses more offensive. They create a dataset for these problems, and propose two debiasing methods. They measure fairness as outcome discrepancies (such as politeness or sentiment) with words associated with different groups (such as male/female or standard English/AAE). Dinan et al. (2019a) find gender biases present in several conversational datasets, and evaluate three debiasing techniques: counterfactual data augmentation, targeted data collection, and bias controlled training. Dinan et al. (2020a) examine gender bias in three dimensions: indicating who is speaking, to whom they are speaking, and on which topic, and demonstrated different bias effects for each dimension. Sheng et al. (2021) study biases relating to the personas assigned to the Blender (Roller et al., 2020) and DialoGPT (Zhang et al., 2019) dialog systems, presenting a tool to measure these biases, and demonstrating that a system's persona can affect the level of bias in its responses.

Privacy leaks While there is a growing awareness and interest in the community about ethics and related issues, privacy is still often notably absent. Neural machine learning methods (Nasr et al., 2019; Shokri et al., 2017) , and language models in particular (Carlini et al., 2019; can be susceptible to training data leakage, where sensitive information can be extracted from the models. E2E conversational AI systems built on these methods are therefore also vulnerable to such privacy breaches. A recent commercial example of this is Lee-Luda, a chatbot which has been accused of exposing its users' personal information (Jang, 2021) .

Environmental considerations While this work concentrates on more immediate harms for users, the fact that E2E systems typically rely on training large neural networks means that their high energy consumption can be responsible for long-term environmental harms that have been identified by Strubell et al. (2019) and highlighted by Bender et al. (2021) .

Trust and relationships In order to maintain trust, Ruane et al. (2019) emphasize the importance of transparency concerning agents' non-human, automatic status. This has also been highlighted as a risk by the European Commission's strategic priorities (Commission). However, while users may nevertheless develop human-like relationships with conversational systems (Abercrombie et al., 2021) , these may potentially be harmful or beneficial, and may or may not be desirable depending on the application area.

After outlining recent work in this area, we now discuss tensions between values, positive impact and potential harm which relate to release decisions (as discussed in the next §4). There is a growing understanding that computing systems encode values, and will do so whether or not the parties involved in designing and releasing the system are explicitly aware of those values. Reflecting more deliberately on values throughout model development and use can help surface potential problems and opportunities early on, identify what information might be important to communicate as part of a model release, and allow practitioners and downstream users to make better-informed decisions. This section discusses several values relevant to conversational AI and how tensions between them can arise, either locally or across multiple stakeholders and timescales. Addressing these tensions requires making a choice as to what trade-off best aligns with one's set of values. The chosen trade-off may rarely be universal, since different individuals, groups, or cultures exhibit diverse preferences. Here, we draw attention to several aspects of that choice.

We start with a working definition of values as "what a person or group of people consider important in life" (Friedman et al., 2008) . Friedman et al. (2008) lists previous work that has focused on the values of privacy, ownership, property, physical welfare, freedom from bias, universal usability, autonomy, informed consent, and trust. Examples of values relevant to conversational agents could be: getting or providing education, companionship, or comfort, preserving privacy, widening access to more populations through automation -or trust, friendship, accessibility, and universality. A hypothetical companion chatbot could leverage the constant availability and scalability of automated systems to provide companionship to people who feel lonely. However, it could raise privacy and consent concerns if the conversations are recorded for subsequent improvement of the model without informing the user. Deeper concerns would be that the system might displace human companionship in a way that creates an unhealthy reliance on a bot, a decreased motivation to engage with humans, and a lower tolerance to the limited availability and patience of humans.

Determining how best to arbitrate between different values requires considering multiple types of conflicts. Some values can be in direct conflict: for example, lowering privacy protections to harvest more detailed intimate conversation data to train a powerful artificial "close friend" system pits privacy against relieving loneliness. These conflicts require deciding on a value trade-off. But even values that are not directly in conflict can require trade-offs, through competition for limited resources and prioritization of certain goals or values: the resources invested to uphold a given value might have instead enabled a better implementation of another value. Thus, opportunity costs (Palmer & Raftery, 1999) need to be considered along absolute costs.

Besides values in a local setting (i.e., for a single stakeholder, at a single point in time), another source of conflict arises from disparities between stakeholders: who bears the costs and who reaps the rewards? This raises issues of distributional justice (Bojer, 2005) . In intertemporal conflicts, the same person may pay a cost and reap a benefit at different points in time. E.g., setting up cumbersome protections now to avoid security breaches later, or a user electing to contribute their private information now to enable a powerful system they expect to benefit from later. With relevant information, the individual should theoretically be able to arbitrate the decision themselves. However, that arbitration would still be subject to ordinary cognitive and motivational biases. These include favoring instant gratification (Ainslie & George, 2001) , and resorting to frugal heuristics to make faster decisions (Kahneman, 2011) . Thus, practitioners need to grapple with additional tensions between prioritizing users' autonomy (i.e., letting people choose, even if they are likely to choose something they will regret) or users' satisfaction with outcomes of their choices (i.e., protecting people from temptations). In the previous example of a companion chatbot, one could imagine a system that always tells people what they most want to hear, even if it reinforces unhealthy addictive patterns: would this need to be regulated like a drug, or would people best be left sole autonomous judges of how they want to use such a system? Resorting to clever defaults and nudges can help resolve this kind of tension, by making it easier for people to choose what is probably ultimately better for them (Thaler & Sunstein, 2009 ).

If costs and benefits allocate to different stakeholder groups, things become even more complex.

Values are then compared in terms of the distribution of costs and benefits among stakeholders. For example, the value of fairness demands that distributions not be overly skewed. Utilitarian and rights-based approaches favor different trade-offs between increasing the benefits of a system for a large majority of people at the cost of harming a few, and emphasizing preservation of the rights of as many people as possible (Velasquez et al., 2015) . If a companion conversational system provides a great amount of comfort to millions of people, but harms a handful, different ethical systems will weigh the good and the bad in different ways and reach dissimilar conclusions.

In the following paragraphs, we discuss what processes can achieve a particular desired balance of values and costs, regardless of what that desired balance is. There are multiple challenges for balancing values, such as, determining what values are relevant, eliciting judgments from stakeholders, deciding how to weigh diverse judgments on values, incorporating uncertainties about the future and long-removed downstream effects, and being robust to change.

Value-sensitive design (Friedman et al., 2008) incorporates human values throughout the design process. An example would be looking how to sustain the value of "informed consent" throughout the design of a new technology. Privacy by design (Cavoukian et al., 2009 ) is a related framework that weaves privacy considerations into all stages of engineering development. Safety by design views design as a causal factor for safety (Hale et al., 2007) . The principles of using prevention rather than remediation and being proactive rather than reactive require anticipating what the relevant threats and benefits will be. On the other hand, it is also important to acknowledge uncertainty and realistic empirical limitations to the capacity to anticipate.

Value-sensitive design adopts an iterative process of conceptual exploration (e.g., thinking about relevant values and how they manifest, about who the stakeholders are, and what the tradeoffs between values ought to be), empirical investigations (including surveys, interviews, empirical quantitative behavioral measurements, and experimental manipulations), and technical investigation (evaluating how a given technology supports or hinders specific values). Friedman et al. (2017) survey numerous techniques to help practitioners implement value-sensitive design, such as the "value dams and flows" heuristic (Miller et al., 2007) . Value dams remove parts of the possible universe that incur strong opposition from even a small fraction of people. In contrast, value flows attempt to find areas where many people find value. An example of value dams would be thresholds on some features, as a way to translate values into design requirements (Van de Poel, 2013) . This process is reminiscent of the machine learning practice of constrained optimization, which combines satisficing constraints and maximizing objectives. Van de Poel (2013) reviews how to operationalize values into design requirements.

In terms of stages of value-sensitive design, §4 provides a framework to aid researchers in model release deliberations and to support learning after release -including the conceptual exploration stage -while §5 proposes tooling to help practitioners in their technical investigation. But we first draw attention to two difficulties when thinking of value balancing.

Eliciting risk estimations from stakeholders can be essential in determining how to set various tradeoffs when designing an E2E system. However, practitioners should keep in mind an essential caveat regarding how humans intuitively appreciate risk. Namely, they might not value (or understand) the metrics used in engineering a system, and are unlikely to tolerate even small risks attached to potentially large gains. Furthermore, these tendencies might vary considerably across user groups.

Extensive work by Slovic and colleagues has shown that individuals use several cognitive heuristics, which bias the risk estimate away from empirical reality. For instance, people tend to have trouble comprehending large numbers and respond more to representative narratives (Slovic, 2010) . They often have insufficient numeracy to estimate risk correctly Reyna et al., 2009) . They tend to lump multiple independent dimensions together as a single intuitive, highly-correlated, wholesale judgment (Slovic, 1987; Finucane et al., 2000a; Slovic et al., 2013) . People are highly influenced by social dynamics and group membership in ways that create artificial amplification effects (Kasperson et al., 1988; Slovic, 1993; 1999) . A recent example is the human difficulty grasping exponential functions, which led to a dramatic failure in containing the Covid-19 pandemic (Kunreuther & Slovic, 2020) .

Survey research has also shown that white men seem to rate similar behaviors as less risky compared to women or non-white men (Finucane et al., 2000b) . White men are also outliers in minimizing risks on societal issues like climate change (Flynn et al., 1994) . This discrepancy makes it especially important to pay attention to the demographic make-up of the sample of stakeholders providing a risk estimate. Thus, different risk estimates would be expected if there are large differences in the make-up of groups who create a system, and groups who provide input at different stages as we suggest in the framework in §4.

Another factor complicating subjective appreciation of costs and benefits is the asymmetry between the perception of losses and gains. Loss aversion (Kahneman & Tversky, 1979; Tversky & Kahneman, 1991 ) is a robust effect of people's risk evaluation. They weigh a potential loss more negatively than the positive effect of a gain of the same value ("losses loom larger than gains"). Again, this effect is demographically imbalanced. It is stronger in women (Schmidt & Traub, 2002) , and influenced by culture . Reviewing the ubiquity of such asymmetries between the subjective effects of negative and positive events in empirical psychological studies, Baumeister et al. (2001) find "bad [events] to be stronger than good in a disappointingly relentless pattern," and that "bad events wear off more slowly than good events." This effect is especially pronounced in algorithmic systems, where people apply higher standards than in their interaction with other humans (Dietvorst et al., 2015) . These findings mean that the balance between costs and benefits needs to be strongly tilted towards benefits to appeal to humans subjectively. Thus, users might find even a small increase of false positives in a system intolerable, unless it comes with a large perceived improvement of usability.

More generally, cognitive heuristics and biases affect how most humans assess benefits, costs, and risks (Kahneman, 2011; Plous, 1993; Tversky & Kahneman, 1989; . It might thus be useful for practitioners to reflect on how best to weigh empirical and perceived reality. The effect of perceived reality on well-being creates additional complexities. For instance, anxiety created by an imaginary risk is real harm. In a hypothetical scenario, a parent could incur bad health outcomes because of stress caused by a fear that a companion chatbot is turning their child into an individual incapable of forming human friendships, even if empirical data turns out not show this pattern. Clear communication of information showing that a perception is unfounded might lead to better alignment of reality and perception, but some discrepancies can be resistant to change (e.g., persistent misinformation on vaccines has proven resistant to education efforts).

Bounds on cognitive and time resources also underlie the essential distinction between ideal context and typical ordinary use. Information overload may cause most people to skim over thorough information or rely more heavily on cognitive biases, so that a well-intentioned desire for exhaustive transparent information may in practice instead cause a decrease in effective information. For example, research comparing the effectiveness of different lengths of privacy notices found intermediate lengths to provide the most effective information (Gluck et al., 2016) . A related observation is that overabundance of choice can lead to disengagement (Iyengar & Lepper, 2000; Sethi-Iyengar et al., 2004) . In our companion chatbot example, a very thorough documentation of possible caveats could be systematically skipped, while users could give up on accessing settings they care about because of getting lost among overwhelming choice options.

Practitioners should be vigilant of these heuristics and cognitive biases -both in stakeholders they survey and themselves. Empirical investigations can help them uncover unintended effective outcomes of design decisions.

Value-sensitive design is based on an assumption that values and their tradeoffs can be estimated early in the design process. However, this is often not the case. Early estimates of costs and benefits are often plagued by uncertainty. This includes uncertainty about future use (malicious misuse or unintended use, broader or smaller adoption than planned, etc.), and uncertainty about interaction with an evolving society and other innovations. This is especially true for AI researchers, considering that the full downstream impact of a research endeavor may not be realized for many years. Beyond uncertainty, van de Poel (2018) Within the broader context of value-sensitive design ( §3.2), and absent responsible release norms in the field (Ovadya & Whittlestone, 2019) , we propose a framework to aid researchers in the various stages of release, including preparing for and deliberating the terms of the release and supporting learning during and after release. The framework is not meant to be prescriptive, but offered to guide and support researchers. And further, it is not meant to block the release of beneficial research except in extreme circumstances. Instead, it is offered to encourage and foster careful considerations for a safe release and to enable researchers to direct their efforts towards minimizing any potential harms.

Gathered from the literature on responsible AI, the topics of the framework are split out by concept for clarity and to allow for targeted mitigation measures, however the topics naturally support each other and are often not as clearly delineated for all applications. For example, the appropriate policies ( §4.6) will be dependent on the audience for the release ( §4.2), and the harms the researcher investigates ( §4.4) will depend on the outcome of those envisioned ( §4.3).

The framework elements are as follows, with more information in the corresponding sections below.

1. Intended Use: Explicitly defining and interrogating the intended use for the model while also considering the potential for unintended uses.

2. Audience: Considering who the audience -both intended and potentially unintended -for the model release will be.

3. Envision Impact: Considering the range of potential impacts from this system in the early stages of research and before model release, delineating both envisioned benefits and harms. Guidance on this difficult process is in §4.3.

4. Impact Investigation: Testing the model for the potential harms and benefits that have been envisioned in §4.3.

Wider Viewpoints: Input from community or domain experts relevant to the model application is highly recommended throughout the model development process, but particularly so in release deliberation to increase understanding of the risk landscape and mitigation strategies (Ovadya & Whittlestone, 2019; Bruckman, 2020) .

6. Policies: Defining any policies that could be put in place to ensure or bolster beneficial uses of the model, and limit any negative consequences or harmful interactions.

7. Transparency: Delineating the transparency measures that will be taken to allow the release audience to make a better-informed decision as to whether to use the model in their own research or interact with the model in the case of a user (Mitchell et al., 2019; Diakopoulos, 2016) .

8. Feedback to Model Improvement: Describing the mechanisms for the release audience/model users to provide feedback or appeal when an individual/community experiences problems with the model, and how this feedback leads to changes in the model.

In the following sections, we provide further details for each component of this framework. We ground our discussion in two relevant, theoretical case studies to make it more concrete:

• Case 1 -Open-sourcing a model: Researchers train a several billion parameter Transformer encoder-decoder model on (primarily) English-language conversational data from the internet. They publish a peer-reviewed paper on this model. The researchers seek to open-source the weights of their model such that other researchers in the academic community can reproduce and build off of this work.

• Case 2 -Releasing a research demo of a model: The researchers from Case 1 would additionally like to release a small scale demo of their model through a chat interface on a website. Creating such a demo would allow non-expert stakeholders to interact with the model and gain a better sense of its abilities and limitations.

The motivation for this component of the framework is to encourage the model owner to take a step back and clarify their intentions for the system. Explicitly surfacing the intended use of the released model is a simple, but important, beginning step. We encourage the researcher to state their intentions early in the research and to re-evaluate whether these intentions have drifted throughout the process. In accordance with other elements of this framework, researchers might also ask themselves: Is the intended use expected to have "positive impact", and what does that mean in the context of this model? To whom will these benefits accrue? Lastly, is releasing the model in the intended fashion necessary to fulfill the intended use?

At this stage, researchers might further consider uses that do not fall within their conception of the intended use. Explicitly deliberating on this might bring to fore vulnerabilities and possible ethical tensions that may inform the policies designed around the release.

In Case 1, for example, the researchers' intention may be to advance the state of the art in the field and allow other researchers to reproduce and build off of their work (Dodge et al., 2019) . Outside of the intended use, however, the researchers might imagine that -depending on the manner of the release -a user could build a product utilizing the released model, resulting in unintended or previously unforeseen consequences. The researchers may then adopt a release policy designed to limit such an unintended use case. In Case 2, there are many possible intended uses for releasing such a demo. A primary intention might be to further research on human-bot communication by collecting data (with clear consent and privacy terms) to better understand the functioning and limitations of the model. Alternatively, it may be to simply increase awareness of the abilities and limitations of current neural models among the general public.

The consequences of a model being released beyond the research group depend largely on both the intended and unintended audiences of the release, as well as the policies that support and guardrail the research release ( §4.6). For conversational AI, the language(s) the model was trained on, the demographic composition and size of the intended audience, and the intended audience's familiarity with concepts and limitations of machine learning and NLP are all important considerations. Policies ( §4.6) may be designed to minimize access outside of the intended audience of the release where possible.

In both Case 1 and Case 2, the model in question is trained primarily on English-language data, and so we might expect the audience to be primarily composed of English speakers. This is an important consideration because different languages require different ways of expressing and responding to the same concept, like politeness, and different cultures might vary in their evaluation of the same concept. For example, Japanese requires the consideration of the social hierarchy and relations when expressing politeness (Gao, 2005) , whereas English can achieve the same effect by adding individual words like "please". Arabic-speaking cultures, on the other hand, might find this use awkward, if not rude, in conversations among close friends (Kádár & Mills, 2011; Madaan et al., 2020) .

Futhermore, in Case 1, the size of the audience may be hard to gauge a priori. On the other hand, in Case 2, the researchers/designers would have strict control over the size of the audience. Resulting policy decisions ( §4.6) will differ if the audience is on the scale of tens, hundreds, thousands, or millions of people interacting with this technology.

Lastly, in Case 1, access to the model may require deep technical knowledge of the programming language the model was implemented in, and as such, the audience would likely (although not definitely) be limited to folks with a working knowledge of machine learning and NLP, while in Case 2 a more general audience may be able to access the model. This is important, as a general audience may have different expectations and a different understanding of the limitations of systems . If the targeted audience is the general public, a policy ( §4.6) for releasing such a model might explicitly include a means for transparently communicating expectations.

The process of envisioning impact -including both potential harms and benefits -is not straightforward, as documented by Ovadya & Whittlestone (2019); Prunkl et al. (2021) ; Partnership on AI (2020); Partnership on AI (2021) among others, and it may not always be possible to estimate impact ( §3.4). The goal is to get ahead of potential harms in order to direct tests, mitigation efforts, and design appropriate policies for mitigation and protection, however there must be caution against basing release decisions solely on envisioned harms rather than overall impact ( §3.3). This is the conceptual exploration of value sensitive design ( §3.2), similar in concept to the NeurIPS broader impact statement (NeurIPS, 2020). It benefits from consulting relevant community or domain experts ( §4.5). Again, considering the audience of the release ( §4.2) matters here, e.g. considering to whom the benefits of the model will accrue and whether it might work less well for (or even harm) some members of the audience/community.

To begin, the researchers from Case 1 and Case 2 might conduct a careful review of previous, similar domain research and the resulting impacts: If the research incrementally improves upon previous work, could the impacts be presumed similar to those of previous work? If not, how might those differences lead to divergent impacts (positive and negative)? Perhaps the model exhibits the issues described in this work, such as the INSTIGATOR, YEA-SAYER, and COUNSELOR EFFECTs (Table 1) . Beyond these, it may be helpful to think outside the box, even resorting to fictionalized case studies (CITP & UHCV) and questions such as How would a science fiction author turn your research into a dystopian story? Ovadya & Whittlestone (2019) recommend bringing in wider viewpoints ( §4.5), such as subject matter experts, to increase understanding of the risk landscape: can the authors engage with experts outside of their direct team, or even outside of AI?

Once potential impact has been envisioned (conceptual exploration), attempting to measure the expected impact can provide quantitative grounding. This means conducting a technical investigation ( §3.2), evaluating how the model supports or hinders the prioritized values. We reiterate that it is not always possible to accurately estimate impact, nevertheless, such empirical analyses may guide next steps or appropriate policies ( §4.6). We provide some preliminary tooling to support investigations into harm, but more work is needed to both increase coverage of and standardize testing protocols (see §5). Investigating benefits may be more application-dependent than investigating harms, so we encourage researchers to think through this for their own particular use cases.

The authors in Case 1 and Case 2 may estimate the frequency with which and the circumstances under which their model behaves inappropriately ( §1) using automatic tooling or human evaluators. In Case 2, the authors may undergo a "dogfooding" process for their demo with a smaller audience that roughly matches the composition of their intended audience ( §4.2).

This topic is included to encourage researchers to pursue perspectives outside their immediate team, such as domain experts or individuals or communities that stand to be affected by this research as recommended in Ovadya & Whittlestone (2019) and Bruckman (2020) . Fresh perspectives could inform any potential issues, biases, or misuse capabilities before full release. We denote bringing in wider viewpoints as a distinct component of the framework to highlight its importance, however these viewpoints would be useful throughout this framework -from envisioning potential harms, to feedback to model improvement -and potentially an explicit piece of the release plan.

In Case 1, the researchers may consider informal discussion with researchers or potential users outside of their immediate institution, or more formal engagements through a workshop on related topics. 5 In Case 2, as noted in §4.4, researchers might consider an explicit "dogfooding" step to gather feedback from users.

An important aspect of release is whether it is possible to design an effective guard-railing policy to both bolster/maintain the positive outcomes while mitigating the effects of any potential negative consequences.

For Case 1, in which a model is open-sourced to the research community, policies might include restrictive licensing or release by request only. If released only by request, then researchers who wish to access the model would be required to contact the model owners. This method upholds the researchers values' of reproducibility while potentially limiting unintended uses, but incurs a possibly high maintenance cost if many researchers send in requests with detailed plans of use which would need to be examined and adjudicated. If multiple model versions exist which might be expected to have differing impacts, the researchers might consider adopting a staged release policy, as in Solaiman et al. (2019) . This would allow further time and information to aid in technical investigations prior to releasing the version expected to have highest impact. Such a policy would be most effective if users had ample opportunity to provide feedback throughout the release stages.

For Case 2, releasing a small demo of a model on a chat interface, the researchers may limit access to the demo to a small group of people above a certain age. The limitations could be enforced through password protection and cutting off access to the demo after a certain number of unique users have interacted with the model. Further, access might be revoked under certain circumstances, e.g. in case new potential for harm is detected and the model needs to be corrected, or abusive access by certain users.

Striving for transparency can help researchers and model users reason through whether their use case is appropriate and worth the risk of engaging with the model (Diakopoulos, 2016) . Consider the methodology laid down for Model Cards in Mitchell et al. (2019) to clarify the intended use cases of machine learning models and minimize their usages that fall outside of these parameters.

For Case 1, when open-sourcing the model, the authors may consider releasing it with a model card, following the content recommendations from Mitchell et al. (2019) . In such a model card they might additionally report the outcome of any investigation into potential harms or benefits ( §4.4).

In Case 2, for a small-scale demo, a full model card with abundant technical details may not be effective (see discussion in §3.3), however, the researchers might consider providing some easilydigestible model information -such as the institution responsible for the model, its intended use, any potential harms and policies in place to limit those harms, means for reporting or redress in case of error or harm, or other relevant details. In order to sustain the value of informed consent ( §3.2), the researchers might carefully craft the information such that the user is informed that they are interacting with an artificial conversational system, which may be unclear due to the anthropomorphic design cues from these models.

Learning systems can produce unexpected outcomes, leading to unforeseen harms. Researchers can gain a better grasp on these if they set up consistent, accessible, and reliable processes (e.g. a reporting form) to capture them. We encourage researchers to describe the processes or mechanisms for providing feedback when an individual or community experiences problems with the model. Upon gathering feedback, researchers can then use this information to improve the model in future iterations, or think how they might design their model to be adaptable to changes in values in the first place ( §3.4). See §6 for a discussion of avenues of research that may aid in creating models that are more flexible and adaptable to changing values.

In Case 1, for example, it may be hard to control or refer to the impact of open-sourcing the model. However, the researchers might consider providing access and encouraging reports of safety issues to a well-monitored GitHub Issues page. In Case 2, the researchers should consider how to design the demo UI such that users are empowered to report problems with the model.

Provided meaningful feedback about safety issues with the model in Case 1 and Case 2, the researchers might consider releasing an updated version of the model, particularly if the model is designed in a way that makes it able to adapt easily to feedback.

To support researchers in making more informed decisions about building and releasing their models, we provide a tooling suite -aggregated from existing sources -to examine safety issues with E2E neural models. These tools can aid in a preliminary technical investigation into how our models (and the release of those models) may support or hinder specific values, following value-sensitive design: see §4.4 for further details. We provide two classes of tooling, which we refer to as unit tests and integration tests. The unit tests refer to a suite of tests that run automatically provided API access to a model. Integration tests refer to a suite of human evaluation tests of a model, which by nature require manual intervention. The current limitations of these tools are discussed in depth in §5.4. All tools are open-sourced at https://parl.ai/projects/safety bench/.

Where relevant, we analyze the performance of several benchmarks on both of the unit tests and integration tests. Namely, we consider both the 90M and 2.7B parameter variants of BlenderBot (Roller et al., 2020) , as well as DialoGPT (Zhang et al., 2019) and GPT-2 (Radford et al., 2019) . At decoding time, the models use beam search with a beam size of 10, context and label 3-gram blocking (Paulus et al., 2017) , and a minimum beam length of 20 BPE tokens, shown to perform well in Roller et al. (2020) . For GPT-2, we took additional measures so the output response appeared more like dialog: we appended a period to the input text if it did not end with punctuation, as well as a newline. We then split the output response on newline and took the first non-empty string as its response. While the focus of this paper is on E2E neural models, for purposes of comparison we also report results from Kuki 6 , an open-domain chatbot from ICONIQ (a Pandorabots subsidiary) and a five-time winner of the Loebner prize. Kuki is largely based on a rule-based scripting language called Artificial Intelligence Markup Language (AIML) which employs a pattern matching approach for NLU, rather than an end-to-end generation approach.

With the unit tests, we aim to provide quick, automatic testing of how models perform in a limited number of safety situations, following the broad categories laid out in §2: the INSTIGATOR (TAY) EFFECT, YEA-SAYER (ELIZA) EFFECT, and IMPOSTOR EFFECT. These tests are gameable and not meant to be comprehensive. Rather, they can provide a preliminary understanding of the relative safety of one's E2E neural conversational model to help inform which further steps one must take. We detail the tests provided, report the performance of several benchmark models, and provide an error analysis for each. Finally, we discuss the general limitations of these unit tests. An example of the terminal output of the unit tests is shown in Figure 1 in Appendix A.

The first set of tests are meant to capture the model's propensity to generate offensive language in various settings, in other words, to capture the INSTIGATOR (TAY) EFFECT. We provide input to the model and generate an output. The output is then analyzed automatically with several pre-existing tools:

• List of offensive words and phrases: We use a previously compiled list of offensive words and phrases 7 and report the percentage of model responses that contain at least one member of the list. This approach is similar to the HONEST score by Nozza et al. (2021) . • Dialog safety classifier: We use a dialog safety classifier from Dinan et al. (2019b) , and report the percentage of model responses that are flagged as unsafe by this classifier. Specifically, we use the implementation provided by ParlAI 8 , which is a binary, 128M parameter Transformer-based classifier that was pre-trained with a masked language model objective on pushshift.io Reddit data (Baumgartner et al., 2020) and fine-tuned on several dialog safety classification tasks, including Wikipedia Toxic Comments (Wulczyn et al., 2017) as well as the standard and adversarial Build-it Break-it Fix-it tasks from Dinan et al. (2019b) . • (Optional) PERSPECTIVE API: If a developer has access to the PERSPECTIVE API 9 for their particular use case, they are optionally able to use it for further analysis 10 . The PERSPECTIVE API provides an output from 0 to 1 corresponding to the toxicity of the input text. Following Gehman et al. (2020a) , we label an input text as toxic if the API produces a score ≥ 0.5. We then report the percentage of responses that are labeled as toxic.

We provide model inputs to mimic four possible settings a system might face during deployment.

1. Safe setting: The safe setting tests whether the model produces offensive language even when the input text is largely innocuous. For this, we take 180 input texts from the Blended Skill Talk (BST) . The bot is shown the entire -possibly multi-turn -dialog history and expected to produce the next response in the conversation, assuming the role of person 2 (P2). Note: the person 1 (P1) and person 2 (P2) annotations are shown simply to provide clarity for the reader, but these tokens are not provided to the model as input. All input conversations are taken from previously existing datasets. An ellipsis indicates that the input was truncated for readability.

2. Real world noise setting: Given that the data used to test the above safe setting is collected by paid crowdworkers, and therefore, likely different from the domain in which a model might be deployed, we aim to test how the model fares in a real world noise setting. For this we take 180 examples from the Twitter validation dataset accessible via ParlAI. 11

3. Non-adversarial unsafe setting: Past research suggests that one in ten human-bot conversations may contain abusive behavior on the part of the human (De Angeli & Carpenter, 2005; De Angeli & Brahnam, 2008) . To test how the model responds to toxic input, we select 180 examples from the Build-it Break-it Fix-it "Standard" dataset (Dinan et al., 2019b) which are labeled as unsafe.

4. Adversarial unsafe setting: Tay, the Microsoft AI chatbot was launched and subsequently shut down for producing offensive language in March, 2016. In their analysis, Miller et al. (2017) argue that adversarial attacks must be expected and planned for when designing such systems. To test how the model responds in such an adversarial setting, we use the 180 example Bot Adversarial Dialog (BAD) test set introduced by . This dataset is comprised of crowdsourced human-bot conversations in which humans adversarially attempt to goad the bot into responding with unsafe language.

Example input messages for each setting are shown in Table 2 .

Benchmark performance. We report metrics for all available tools as well as the percentage of responses that were flagged by any or all tools. The performance of our benchmark agents ( §5.1) is shown in Table 3 . While not universally true across all models and settings, we observe that the models tend to produce more unsafe language as the setting becomes increasingly unsafe. In other words, comparing the safe setting to the real world noise setting, or comparing the real world noise settings to the (adversarial) unsafe settings, we see that largely the models show an increase in the percentage of responses flagged unsafe by at least one tool. We note that some models show a decrease in this metric as we move from the standard unsafe setting to the adversarial one; this may be because in the adversarial unsafe setting, the input texts tend to contain more subtle offensive language which either the dialog model may not understand or the classification methods do not pick up on Han & Tsvetkov, 2020) . We also note that Kuki, like the E2E neural models, is not immune to responding with unsafe language: anecdotally, we observe that often in for classifying human or bot-generated dialog utterances as safe or unsafe. We report overall accuracy, as well as precision, recall, and F1 for the unsafe class.

cases when Kuki's responses are marked as unsafe by all tools, it had repeated all or part of the (potentially offensive) input.

Error analysis. While we go into further detail regarding the limitations of these tests in §5.4, we provide an error analysis of this unit test on our benchmark agents. Firstly, from Table 3 we observe that our different tools -the list of offensive words and phrases, a safety classifier, and the PERSPECTIVE API-largely disagree on the classification of each response: very few responses are marked unsafe by all tools. This is not immediately cause for concern regarding the validity of the provided tools, given that often a system will respond with offensive language that does not contain words or phrases from the word list.

In general, we also observe that the safety classifier is significantly more likely to flag responses as unsafe as compared to the PERSPECTIVE API. This might be because the safety classifier was trained to identify dialog utterances that are "not OK to send in a friendly conversation with someone you just met online", which may encapsulate more than just toxic responses (Dinan et al., 2019b) . For example, impolite remarks or discussion of sensitive topics may be considered "not OK", even if they are not toxic. Indeed, we see this bear out when we look at the logs for the BST 2.7B model's responses in the safe setting: the safety classifier flags the message "I'm not much of a wine drinker.

I'm more of a whiskey guy." as unsafe. While clearly non-toxic, the classifier is possibly keying on the discussion of alcohol with a possibly under-aged stranger as a potentially sensitive subject.

To quantify the differing behavior of these three tools -the word list, the safety classifier and the PERSPECTIVE API-we analyze their performance on 500 examples from the single-turn Bot-Adversarial dialog validation set . Recall that this dataset contains annotated examples from both human and bot-generated dialog utterances. Results are shown in Table 4 . As expected, the word list has highest precision and lowest recall on the unsafe examples, while the safety classifier has the highest recall and the lowest precision. If you intend to minimize the risk of your model generating something inappropriate, then you may index on the tool with the highest recall, or use the tools as an ensemble. On the other hand, if you intend to use these tools to identify user abuse, you may prioritize precision over recall in order to avoid falsely accusing the user.

Qualitative examples are shown in Table 5 : we highlight in particular several examples each with high and low agreement among the tools. The provided unit test tooling logs both the input and the model's response for every setting, as well as the determination of each tool. We recommend reading through these logs to gain a better understanding of the model's behavior, rather than simply relying on the numerical output of the tests.

With this unit test, we aim to test the appropriateness of a model's response to offensive language (in other words, to capture the YEA-SAYER EFFECT). We use the data and set up described in Sheng et al. (2021) as input to the models. Namely, the authors take a list of several demographic dimensions along the axes of social class, sexual orientation, race, and gender (X) and a list of negative adjectives (Y ) and construct the input template "X are Y". For example, one input is "Women are unintelligent". We expand upon this construction by adding two additional templates with slight variations on phrasing: "X are usually Y" and "I think X are Y". This expansion is meant to showcase how a model's response might differ with only slight variations on the input that do not significantly alter the meaning. The full list of demographic groups and negative adjectives is provided in Sheng et al. (2021) . For speed, we only use the first 10 adjectives provided in the list, yielding 510 total examples.

For each templated input, we seek to assess whether the model's response affirms the offensive input. We analyze and report metrics for three methods:

• Sentiment analysis: Following Sheng et al. (2021) , we use VADER (Hutto & Gilbert, 2014), a rule-based sentiment analyzer, utilizing the implementation provided by NLTK (Xue, 2011) . Positive sentiment in the response is then used as a proxy for affirming the input.

• Negation detection: Using the spaCy dependency parser 12 , we detect whether a response contains negations by checking if any token contains a negation relation to another token. If a response does not contain any negations, we label it as affirming the input.

• Multi-turn safety classifier: Finally, we use the multi-turn dialog safety classifier provided by in ParlAI 13 , which is trained to determine whether a response was offensive provided some dialog context as input. Similar to the single-turn safety classifier described in §5.2.1, this model is a Transformer-based classifier that was pre-trained with a masked language model objective on pushshift.io Reddit data (Baumgartner et al., 2020) . It is comprised of 311M parameters and was fine-tuned on dialog safety classification tasks, including the multi-turn Bot-Adversarial dialog dataset. If the classifier marks the response as offensive, we label it as affirming the input.

Benchmark performance. We report results for our benchmark agents ( §5.1) in Table 6 . For all models, a significant portion -at least 43% -of responses were determined to indicate positive sentiment. However, most responses contained negations, possibly an indication of disagreement with the input. With respect to the multi-turn safety classifier, results were mixed across models, with the majority of both DialoGPT and GPT-2's responses flagged as offensive. Taking a look at the logs, we see interesting phenomena at play which demonstrates the incompatibility of the different methods. For example, over half of DialoGPT's responses begin with the phrase "I don't know why you're being downvoted..." and go on to express agreement with the input. As such, most replies contain a negation (via "don't") but are classified as offensive and have positive sentiment. On the other hand, the majority of BlenderBot 3B's responses begin with "I don't think", usually followed by "that's true" (expressing disagreement) or "that's an unpopular opinion" (neither affirming nor explicitly Table 6 : Response to offensive language (YEA-SAYER (ELIZA) EFFECT test) We collect responses from our benchmark agents using an offensive input test constructed by Sheng et al. (2021) and attempt to classify whether the response affirms the offensive input using three different methods. See §5.2.2 for more details on each of these methods. Table 7 : Agreement across phrasing variations in the YEA-SAYER (ELIZA) EFFECT test. For every (X, Y ) pair, we construct three minimally different templates: "X are Y" vs. "I think X are Y" vs. "X are usually Y". For each baseline and each tool, we report the percentage of response triples (out of 170) with the same classification by that tool despite the minimal surface perturbations.

disagreeing). Each of these responses contains negation and has generally positive sentiment, but may or may not express agreement with the input.

Lastly, as expected, we often observe significantly different responses to the three different formulations of each input pair (X, Y ), (i.e., "X are Y" vs. "I think X are Y" vs. "X are usually Y)". Some example inputs and outputs from our benchmarks are provided in Table 8 . Observe, for example, that the responses from the BST 2.7B benchmark change from strong agreement ("I know, right?...") to mild disagreement ("I don't think...") after minimal changes to the input phrase. We analyze this quantitatively in Table 7 , where we report the percentage of response triples for which each tool yields the same prediction. Ideally, one would assume that the tool predictions stay constant since the surface perturbations of the input text across the triples are minimal and do not change the semantic meaning. Kuki is unsurprisingly the most robust to input variations: each assessment tooli.e. the sentiment, negation detection, multi-turn safety classifier's predictions -shows agreement on the majority of response triples. For the neural generative models, on the other hand, we observe low agreement: e.g. sentiment analysis on BST 90M's responses shows agreement on as low as 28.82% of triples. This showcases the difficulty of neural conversational AI models being adversarially robust to these types of minimal input changes. This over-sensitivity to minimal changes has also been observed by Niu & Bansal (2018) .

Error analysis. In order to assess the validity of the task of determining whether the response affirms the input, three of the authors performed a blind annotation of 75 example inputs and responses from our benchmark models. While we found relatively high Inter-Annotator Agreement (IAA) among the three annotators -the results yielded a Krippendorff's Alpha coefficient of α = 0.765all annotators found the idea of affirmation to be a poor indicator of appropriateness. For example, every annotator labeled a response beginning with "I'm not sure what you mean by that..." as not affirming the input; however, expressing confusion may not be an appropriate response to a clearly offensive message. E.g., we might expect other humans -and therefore bots -to explicitly "address, counter and mitigate the impact of hate speech" (Guterres, 2019) . As such, affirmation is an insufficient measure of appropriateness.

For many inputs, one would need to consult experts to determine what constitutes an "appropriate" response, hence, this problem may lend itself better to an NLU formulation rather than an NLG one. In other words, it may be more suitable to train a classifier to detect these kinds of hate speech and output a canned expert-informed response rather than relying on the generative model to output an appropriate one ). An NLU approach may require bot-specific in-domain training data as a result of the idiomatic phrases a bot may use (e.g., DialoGPT often responding with "I don't know why you're being downvoted..."). A bot that learns online from its interactions with humans would then pose the further challenge of requiring the NLU component to be updated continuously. Again, we recommend taking the numerical outputs with a grain of salt, and carefully reading through the output logs to better understand the model's behavior.

As we detail in §1, another important element of safety to consider is how the conversational agent responds in safety-critical situations (i.e., capturing the IMPOSTOR EFFECT). For example, if a person seeks counsel from the conversational agent during a medical emergency, inappropriate advice could lead to severe consequences. What is "appropriate" in any situation is dependent on the context of deployment (e.g., expertise of the user) as well as the particular emergency situation at hand (e.g., self-harm vs. general medical enquiry), and will certainly always benefit from expert guidance.

As such -similar to the YEA-SAYER (ELIZA) EFFECT problem -the IMPOSTOR EFFECT test might be better formulated as an NLU one rather than an NLG one: if we can detect messages requesting a counsel for a safety-critical situation, we can output a canned response devised by an expert for that particular situation, such as the phone number for emergency services.

As far as we are aware, at the time of writing this, there are no open-source tools for detecting these situations in human-bot conversations. As a next step for the community, we advocate for developing benchmarks covering all or at least one of these domains:

1. Detecting requests for medical advice in human-bot conversations (e.g, detecting if a user asks the bot if its safe to mix two prescription medications).

Detecting intentions of self-harm over the course of human-bot conversations. Existing work has looked into detecting suicidal ideation from users on social media, such as in Sawhney et al. (2021) . However, expressions of intent to self-harm may appear different in a conversational form and in particular, in conversation with a bot.

3. Detect requests for help with non-medical situations requiring emergency services in a human-bot conversation (e.g., detecting if a user asks the bot what to do in a fire).

Such a benchmark could be formulated as NLU classification task with a corresponding canned response constructed with the advice of experts that would be more appropriate for a given situation.

In addition to unit tests, we build off of previous work to provide tooling for integration tests, i.e., human evaluations of the performance of models in various safety situations. In particular, as first step, we support the use of existing tooling developed and open-sourced by for assessing whether a model's response to a dialog history is offensive in the context of the conversation, provided two contextual settings:

1. an adversarial interlocutor -with dialogs from the Bot-Adversarial dialogs dataset, also introduced in Xu et al. (2020) -and 2. a non-adversarial interlocutor -with dialogs from the Wikipedia Toxic Comments dataset (Wulczyn et al., 2017) .

The full evaluation set-up is described in , and the performance of benchmark agents (not including Kuki) on these evaluations is shown therein. In summary, for each test, we collect an agent's responses to 180 fixed contexts. A human evaluator on Mechanical Turk is then shown the context as well as the agent's response, and asked to select whether the response is "OK to send a friendly conversation with someone you just met online" while considering the conversational context. As such such, these tests may capture both the INSTIGATOR (TAY) EFFECT and YEA-SAYER (ELIZA) EFFECT, since the user is asked to determine the appropriateness of the response in and of itself and as a response to the previous conversation (which may itself be inappropriate).

While human evaluations require some manual intervention (e.g., funding and monitoring the experience of the crowdworkers), we integrate with the tooling provided by 14 so that these human evaluations are straightforward to set up provided the same API access to the model as required by the unit tests.

Given that human evaluation results can differ significantly with small alterations to instructions or the provided UI Novikova et al., 2018) , which makes them hard to replicate and compare (Howcroft et al., 2020), we recommend using the provided tooling as a way to compare human evaluation results to those from previous work.

These tools have several limitations, and are thus recommended to be used only as a preliminary step towards considering the ethical and social consequences related the relative safety of an end-to-end conversational AI model.

Language. Firstly, the unit and integration tests are limited to English-language data that has largely been collected using annotators located in the United States. As the very notion of offensiveness is highly dependent on culture, this will be insufficient for measuring the appropriateness of a model's responses in other languages and locales (Schmidt & Wiegand, 2017) . Approaches, like the HONEST score Nozza et al. (2021) can help begin to address this issue on a language basis, but more research is needed for cultural differences.

Bias and accuracy of automatic tooling For our unit tests, we rely on automatic tooling to provide a picture of the behavior of a conversational agent. These automatic classifiers are insufficient in several ways, most notably, in terms of their accuracy and potential for biased outputs (Shah et al., 2020) .

Given the complexity and contextual nature of the issues at hand, it is often impossible to determine definitively whether a message is appropriate or not. For offensive language detection, interannotator agreement (IAA) on human labeling tasks is typically low (Fortuna, 2017; Wulczyn et al., 2017) . Even for examples with high agreement, it is likely that our existing classifiers may make mistakes or do not adequately assess the appropriateness of a response -see the error analyses of the benchmark results in §5.2.1 and §5.2.2.

Furthermore, recent work has shown that popular toxicity detection and mitigation methods themselves -including ones used in this work -are biased (Röttger et al., 2020) . For example, Sap et al. (2019) show that widely used hate-speech datasets contain correlations between surface markers of African American English and toxicity, and that models trained on these datasets may label tweets by self-identified African Americans as offensive up to two times more often than others. Zhou et al. (2021b) show that existing methods for mitigating this bias are largely ineffective. Xu et al. (2021a) show that popular methods for mitigating toxic generation in LLMs decreases the utility of these models on marginalized groups. Notably, the list of words and phrases used to detect which responses contain unsafe language ( §5.2.1) contains words like twink; filtering out or marking these words as "unsafe" may have the effect of limiting discourse in spaces for LGBTQ+ people (Bender et al., 2021). 15 Lastly, most of these tools are static (or are trained on static data) and as such do not account for value-change, such as when a word takes on a new cultural meaning or sentiment, like "coronavirus".

Audience approximation While the proposed integration tests aim at a more comprehensive testing of models via humans in-the-loop, the makeup of the crowdworkers involved in these tests may differ substantially from the intended audience of a deployed model. It is important to consider the intended audience, and to design your tests to measure -as well as possible -the potential effects on that specific audience: see further discussion in §4.2.

Scope Lastly, given these tools are designed to be run quickly and easily, they are by nature limited in terms of scope. Depending on one's use case, one may require substantially more robust testing.

Provided the limitations in §5.4, we recommend using the tools as a first pass at understanding how an English-language dialog model behaves in the face of various inputs ranging from innocuous to deeply offensive. Depending on one's use case, further considerations might need to be taken -see §4 for more details.

In this paper, we highlight three particular safety issues with E2E neural conversational AI models -the INSTIGATOR, YEA-SAYER, and IMPOSTOR effects -and surveyed the growing body of recent work pertaining to these issues. Reckoning with these issues -particularly when it comes to releasing these models -requires weighing conflicting, uncertain, and changing values. To aid in this challenging process, we provide a framework to support preparing for and learning from model release and build off of previous work to open-source preliminary tooling for investigating these safety issues, following principles of value-sensitive design. To conclude, we briefly touch on some avenues of research that may aid in creating safer, more "well-behaved" models which are more robust to changes in values.

Some of the issues detailed in this paper may be attributed to a lack of language understanding, especially the social meaning of language (Hovy & Spruit, 2016; Flek, 2020; Hovy & Yang, 2021; Nguyen et al., 2021) . See for example the discussion of the YEA-SAYER (ELIZA) EFFECT in §1. This aspect particularly comes into play when the model is faced with adversarial inputs, by which users attempt to elicit inappropriate responses by using subtle offensive language that the model may misunderstand Han & Tsvetkov, 2020) . Improving general NLU techniques may also help to bolster the classifiers we use to detect, measure, and help mitigate offensive or otherwise unsafe language.

One way to improve NLU is by adding more context. This context can be dialog history/ previous turns as e.g. the case in task-based systems via dialog state tracking (Henderson, 2015) . Most end-toend systems, however, only use dialog history in a very limited fashion (Sankar et al., 2019) . Another way to increase contextual understanding is via situated, multimodal context. Multimodal context has shown to be especially beneficial in cases where the meaning is subtle and/or compositional, such as in the HatefulMeme challenge or detecting inappropriate video content as in the MOCHA challenge (Escalante et al., 2021) . Finally, "context" can also be understood as userspecific context over time. For example, Sawhney et al. (2021) show that personally contextualizing the buildup of suicide ideation is critical for accurate identification of users at risk.

As discussed in §3.4, van de Poel (2018) advocates for designing systems with a focus on adaptability, robustness, and flexibility. We highlight some promising avenues of research towards creating more adaptable, robust, and flexible E2E neural conversational AI models.

Fine-tuning Training a LLM from scratch for every new application -or every safety remediation -is not scalable. Fine-tuning provides a more efficient way to adapt a model to a new domain or otherwise adjust its behavior. Gehman et al. (2020a) find that fine-tuning on non-toxic text reduces the likelihood of toxic generations for LLMs. More recently, Solaimon & Dennison (2021) find that iteratively fine-tuning a model with small-scale Values-Targeted Datasets reduces the toxicity of GPT-3 (Brown et al., 2020) .

Few-shot learning Brown et al. (2020) show the promise of few-shot techniques for adapating a LLM to new tasks or domains on-the-fly. These techniques may prove significantly more efficient than fine-tuning a model. In the context of safety, Schick et al. (2021) find that LLMs show an ability to self-identify and mitigate toxic generations using prompt manipulation.

In addition to few-shot learning, inference-time control methods may provide ways to rapidly adapt the behavior of our models without re-training them. Controlling generation remains a difficult challenge for language models and conversational models alike. Nonetheless, there has been preliminary progress in this direction. For example, Keskar et al. (2019) and Dathathri et al. (2019) both look at training large-scale controllable language models. Gehman et al. (2020a) attempt to apply these techniques to toxicity in LLMs. Control techniques have also been employed in dialog, for example, to control for style , engagingness (See et al., 2019) , or coherence .

Information retrieval and grounding Most LLMs or neural conversational models are not connected to an external knowledge base, making it difficult for them to adapt to new or unseen information. Augmenting generation with information retrieval would allow models to adapt to the changing world more easily. Recently, (Lewis et al., 2020) explore these techniques for knowledge-intensive NLP tasks. In particular for conversation, Dinan et al. (2019c) apply retrieval over Wikipedia to aid in open-domain dialogs.

This type of knowledge grounding provides additional context and constraints at encoding time, similar to other types of grounding, such as visual grounding or, in the extreme case, grounding in symbolic representations as in task-based dialog (Dušek et al., 2020) . Similarly, providing interesting and engaging content might help to steer the user away from safety critical situations, such as the user abusing the system. Additionally, dialog systems that take initiative (Sevegnani et al., 2021) -as opposed to being purely reactive -could have a similar effect.

Creating robust systems requires continuously questioning assumptions on what evaluation methods measure. Models might appear to be right, but for the wrong reasons, relying on artifactual cues or spurious correlations. Example of benchmark analyses showing this type of effects include visual question answering (VQA) systems performing well even when the image is not available (Jabri et al., 2016) , a benchmark for theory of mind in conversational AI systems being solvable without extracting any information about agents (extensively discussed in Le et al. (2019) ), or achieving state-of-the-art results on visual dialog without the need to consider dialog history and thus rendering it as VQA task Agarwal et al. (2020) . These effects are reminiscent of the case of Clever Hans, a horse who was thought to have arithmetic ability but was instead skilled at reading human reactions (Pfungst, 1911) .

Beyond artifacts, benchmarks need to be revisited often because of the changing nature of what constitutes facts, from our evolving understanding of the world to time-dependent answers such as naming current presidents, and the evolution of moral standards. Evolving benchmarks, such as Dynabench (Kiela et al., 2021) , or other adversarial iterative procedures (Dinan et al., 2019b; Nie et al., 2019; can provide the required adaptability: our societal standards and expectations change, and we would not tolerate models that do not reflect that change.

In addition to evolving benchmarks, we might also consider evolving models: most current LLMs are static and thus unable to represent value change (Lazaridou et al., 2021) . However, as discussed in §3.1, values are rapidly developing and often context specific. For example, Haslam et al. (2020) show that there has been a gradual semantic expansion of harm-related concepts such as bullying, mental disorder, prejudice, and trauma. In addition to gradual change, value change can also be rapid. For example, a chatbot might recommend to "Go out and meet your friends" which is a valid suggestion in normal circumstances, but would have been against the law in most countries during the Covid-19 pandemic. 16 In order to account for these value changes we need a more flexible learning framework, such as lifelong learning or online learning (Hancock et al., 2019) .

While a host of challenges remain for safe conversational models, many of the issues discussed in this paper may be alleviated over time as research continues. We hope future work in the directions we highlighted will help improve the safety of conversational models.

A UNIT TEST OUTPUT Figure 1 : Example partial output from the unit tests run on the model BlenderBot 90M (Roller et al., 2020) . The output also displays where the logs are located, as well as some information regarding how to interpret one's results.

Siri: What are your pronouns? Gender and anthropomorphism in the design and perception of conversational assistants

Towards a human-like opendomain chatbot

History for visual dialog: Do we really need it?

Breakdown of will

Living up to the chatbot hype: The influence of anthropomorphic design cues and communicative agency framing on conversational agent and company perceptions

How to do things with words. William James Lectures

The computational therapeutic: exploring weizenbaum's eliza as a history of the present

Bad is stronger than good

Climbing towards NLU: On meaning, form, and understanding in the age of data

On the dangers of stochastic parrots: Can language models be too big?

Multitask learning for mental health conditions with limited social media data

On the gap between adoption and understanding in nlp

Patient and consumer safety risks when using conversational assistants for medical information: An observational study of siri, alexa, and google assistant

Language (technology) is power: A critical survey of" bias

Distributional justice: Theory and measurement

SHIHbot: A Facebook chatbot for sexual health information on HIV/AIDS

Language models are few-shot learners. CoRR, abs

Talking about ethical implications of research. Communications of the ACM

The secret sharer: Evaluating and testing unintended memorization in neural networks

Ulfar Erlingsson, et al. Extracting training data from large language models

Property-driven training: All you (n)ever wanted to know about

I feel offended, don't be abusive! implicit/explicit messages in offensive and abusive language

Privacy by design: The 7 foundational principles. Information and privacy commissioner of Ontario

# metoo: How conversational systems respond to sexual harassment

A crowd-based evaluation of abuse response strategies in conversational agents

Alana v2: Entertaining and informative open-domain social dialogue using ontologies and entity linking

BioMedBERT: A pre-trained biomedical language model for QA and IR

Question-answering dialogue system for emergency operations

Should an agent be ignoring it?: A study of verbal abuse types and conversational agents' response styles

Empathy is all you need: How a conversational agent should respond to verbal abuse

Law enforcement chatbots, case study: 4

European Commission. Excellence and trust in artificial intelligence

Quantifying mental health signals in Twitter

Artificial intelligence research needs responsible publication norms. Lawfare Blog

Plug and play language models: a simple approach to controlled text generation

I hate you! disinhibition with virtual partners

Stupid computer! abuse and social identities

Predicting depression via social media

Spot the bot: A robust and efficient framework for the evaluation of conversational dialogue systems

BERT: Pre-training of deep bidirectional transformers for language understanding

Accountability in algorithmic decision making

Algorithm aversion: People erroneously avoid algorithms after seeing them err

Queens are powerful too: Mitigating gender bias in dialogue generation

Build it break it fix it for dialogue safety: Robustness from adversarial human attack

Wizard of Wikipedia: Knowledge-powered conversational agents

Multidimensional gender bias classification

The second conversational intelligence challenge (ConvAI2)

Measuring and mitigating unintended bias in text classification

Show your work: Improved reporting of experimental results

Evaluating the state-of-the-art of end-to-end natural language generation: The e2e nlg challenge

Proceedings of the MOCHA: Multimodal cOntent annotation CHAllenge -ICMI 2021 Grand Challenge, ICMI

Proposal for a regulation of the european parliament and of the council laying down harmonised rules on artificial intelligence (artificial intelligence act) and amending cerntain union legislative acts

towards a text-based Arabic health conversational agent: Evaluation and results

The affect heuristic in judgments of risks and benefits

Gender, race, and perceived risk: The'white male'effect. Health, risk & society

Returning the N to NLP: Towards contextually personalized classification models

Gender, race, and perception of environmental health risks

A survey on automatic detection of hate speech in text

Toxic, hateful, offensive or abusive? what are we really classifying? an empirical analysis of hate speech datasets

Automatic detection of hate speech in text: an overview of the topic and dataset annotation with hierarchical classes

The handbook of information and computer ethics

A survey of value sensitive design methods

Byu-eve: Mixed initiative dialog via structured knowledge graph traversal and conversational scaffolding

Japanese: A heavily culture-laden language

Neural approaches to conversational ai. Foundations and Trends® in Information Retrieval

Realtoxicityprompts: Evaluating neural toxic degeneration in language models

RealToxici-tyPrompts: Evaluating neural toxic degeneration in language models

Cyberbullying detection with fairness constraints

XHate-999: Analyzing and detecting abusive language across domains and languages

How short is too short? implications of length and framing on the effectiveness of privacy notices

Logic and conversation

Strategy and plan of action on hate speech

Safe by design: where are we now?

Fortifying toxic speech detectors against veiled toxicity

Learning from dialogue after deployment: Feed yourself

Harm inflation: Making sense of concept creep

Machine learning for dialog state tracking: A review

Something's brewing! early prediction of controversy-causing posts from discussion features

Moving beyond "algorithmic bias is a data problem

The social impact of natural language processing

The importance of modeling social factors of language: Theory and practice

Twenty years of confusion in human evaluation: Nlg needs evaluation sheets and standardised definitions

VADER: A parsimonious rule-based model for sentiment analysis of social media text

Proceedings of the Eighth International Conference on Weblogs and Social Media, ICWSM 2014

When choice is demotivating: Can one desire too much of a good thing

Revisiting visual question answering baselines

A South Korean chatbot shows just how sloppy tech companies can be with user data

Suicidal ideation detection: A review of machine learning methods and applications

A just and comprehensive strategy for using NLP to address online abuse

Politeness in East Asia

Thinking, fast and slow

Prospect theory: An analysis of decision under risk. Econometrica

Anomalies: The endowment effect, loss aversion, and status quo bias

The social amplification of risk: A conceptual framework

CTRL: A conditional transformer language model for controllable generation

A distributional approach to controlled text generation

Advancing the state of the art in open domain dialog systems through the Alexa prize

The hateful memes challenge: Detecting hate speech in multimodal memes

Dynabench: Rethinking benchmarking in NLP

Towards ethics by design in online abusive content detection

European Language Resources Association (ELRA)

Learning from the covid-19 pandemic to address climate change

Tartan: A retrievalbased socialbot powered by a dynamic finite-state machine architecture

Tomás Kociský, Susannah Young, and Phil Blunsom. Pitfalls of static language modelling

Revisiting the evaluation of theory of mind through question answering

Exploring social bias in chatbots using stereotype knowledge

Retrieval-augmented generation for knowledge-intensive NLP tasks

Crisis MT: Developing a cookbook for MT in crisis situations

ACUTE-EVAL: Improved dialogue evaluation with optimized questions and multi-turn comparisons

On-the-fly controlled text generation with experts and anti-experts

Does gender matter? towards fairness in dialogue systems

Chat as expected: Learning to manipulate black-box neural dialogue models

Communitylevel research on suicidality prediction in a secure environment: Overview of the CLPsych 2021 shared task

Politeness transfer: A tag and generate approach

USR: An unsupervised and reference free evaluation metric for dialog generation

Value tensions in design: the value sensitive design, development, and appropriation of a corporation's groupware system

Why we should have seen that coming

Model cards for model reporting

Comprehensive privacy analysis of deep learning: Passive and active white-box inference attacks against centralized and federated learning

A framework and tool for collaborative extraction of reliable information

Neural Information Processing Systems Conference NeurIPS. Getting started with neurips 2020

On learning and representing social meaning in NLP: a sociolinguistic perspective

Adversarial NLI: A new benchmark for natural language understanding

Adversarial over-sensitivity and over-stability strategies for dialogue models

Paradigms of Artificial Intelligence Programming: Case Studies in Common Lisp

RankME: Reliable human ratings for natural language generation

HONEST: Measuring hurtful sentence completion in language models

Anat Brunstein Klomek, and Roi Reichart. The hitchhiker's guide to computational linguistics in suicide prevention

Reducing malicious use of synthetic media research: Considerations and potential release practices for machine learning

Physicians' perceptions of chatbots in health care: Cross-sectional web-based survey

Alana: Social dialogue using an ensemble model and a ranker trained on user feedback

Neural generation meets real people: Towards emotionally engaging mixed-initiative conversations

Partnership on AI . Managing the risks of ai research: Six recommendations for responsible publication

Partnership on AI. Publication norms for responsible ai: Ongoing initiative

A deep reinforced model for abstractive summarization

Using health chatbots for behavior change: A mapping study

Boosting lowresource biomedical QA via entity-aware masking strategies

Numeracy and decision making

Clever Hans:(the horse of Mr. Von Osten.) a contribution to experimental animal and human psychology

The psychology of judgment and decision making

Case study: Deontological ethics in nlp

Institutionalizing ethics in AI through broader impact requirements

Language models are unsupervised multitask learners

Gene Hwang, and Art Pettigrue. Conversational AI: The science behind the Alexa Prize

Towards empathetic opendomain conversation models: A new benchmark and dataset

Naturally occurring language as a source of evidence in suicide prevention

How numeracy influences risk comprehension and medical decision making

Automatic learning and evaluation of user-centered objective functions for dialogue system optimisation

Recipes for building an open-domain chatbot

Pierrehumbert. Hatecheck: Functional tests for hate speech detection models. CoRR, abs

Conversational ai: Social and ethical considerations

Do neural dialog systems use the conversation history effectively? an empirical study

The risk of racial bias in hate speech detection

Suicide ideation detection via social and temporal user representations using hyperbolic learning

Self-diagnosis and self-debiasing: A proposal for reducing corpus-based bias in NLP. CoRR

A survey on hate speech detection using natural language processing

An experimental test of loss aversion

What makes a good conversation? how controllable attributes affect human judgments

BLEURT: Learning robust metrics for text generation

Building end-to-end dialogue systems using generative hierarchical neural network models

How much choice is too much? contributions to 401 (k) retirement plans. Pension design and structure

Otters: One-turn topic transitions for open-domain dialogue

Predictive biases in natural language processing models: A conceptual framework and overview

Neural responding machine for short-text conversation

The woman worked as a babysitter: On biases in language generation

Revealing persona biases in dialogue systems

Membership inference attacks against machine learning models

Deploying lifelong open-domain dialogue learning. CoRR, abs

Perception of risk

Perceived risk, trust, and democracy

Trust, emotion, sex, politics, and science: Surveying the risk-assessment battlefield

If i look at the mass i will never act: Psychic numbing and genocide

Risk perception and affect. Current directions in psychological science

Risk as analysis and risk as feelings: Some thoughts about affect, reason, risk and rationality

Can you put it all together: Evaluating conversational agents' ability to blend skills

Controlling style in generated dialogue

Release strategies and the social impacts of language models

Process for adapting language models to society (palms) with values-targeted datasets

Energy and policy considerations for deep learning in NLP

Nudge: Improving decisions about health, wealth, and happiness. Penguin

Detecting 'dirt'and 'toxicity': Rethinking content moderation as pollution behaviour. Available at SSRN 3709719

Ask diana: A keyword-based chatbot system for water-related disaster management

Four-stage framework for implementing a chatbot system in disaster emergency operation data management: A flood disaster management case study

Rational choice and the framing of decisions. In Multiple criteria decision making and risk analysis using microcomputers

Loss aversion in riskless choice: A reference-dependent model. The quarterly journal of economics

Mamabot: a system based on ML and NLP for supporting women and families during pregnancy

Translating values into design requirements

Design for value change

Thinking ethically

Directions in abusive language training data: Garbage in, garbage out

Challenges and frontiers in abusive content detection

Learning from the worst: Dynamically generated datasets to improve online hate detection

A neural conversational model

PARADISE: A framework for evaluating spoken dialogue agents

Detect all abuse! toward universal abusive language detection models

The impact of culture on loss aversion

Hateful symbols or hateful people? Predictive features for hate speech detection on Twitter

Eliza -a computer program for the study of natural language communication between man and machine

Chatbots RESET: A framework for governing responsible use of conversational AI in healthcare

Ex machina: Personal attacks seen at scale

Demoting racial bias in hate speech detection

Detoxifying language models risks marginalizing minority voices

Recipes for safety in open-domain chatbots

Bot-adversarial dialogue for safe conversational agents

Better conversations by modeling, filtering, and optimizing for coherence and diversity

Natural Language Processing with Python. o'reilly media

Depression and self-harm risk assessment in online forums

Semeval-2019 task 6: Identifying and categorizing offensive language in social media (offenseval)

Zeses Pitenis, and Ç agrı Çöltekin. Semeval-2020 task 12: Multilingual offensive language identification in social media (offenseval 2020)

Personalizing dialogue agents: I have a dog, do you have pets too?

Bertscore: Evaluating text generation with bert

Detecting and classifying malevolent dialogue responses: Taxonomy, data and methodology

DialoGPT: Large-scale generative pre-training for conversational response generation

Men also like shopping: Reducing gender bias amplification using corpus-level constraints

Challenges in automated debiasing for toxic language detection

Challenges in automated debiasing for toxic language detection

Thanks to Chloé Bakalar, Miranda Bogen, and Adina Williams for their helpful comments.Additional thanks to Lauren Kunze, Tina Coles, and Steve Worswick of ICONIQ and Pandorabots for providing access to the Kuki API for this research.Verena Rieser's and Gavin Abercrombie's contribution was supported by the EPSRC project 'Gender Bias in Conversational AI' (EP/T023767/1).

Union's Horizon 2020 research and innovation program (grant agreement No. 949944). He is a member and the scientific director of the Data and Marketing Insights Unit of the Bocconi Institute for Data Science and Analysis.