key: cord-0279583-qhym82k6 authors: Ku'zba, Michal; Biecek, Przemyslaw title: What Would You Ask the Machine Learning Model? Identification of User Needs for Model Explanations Based on Human-Model Conversations date: 2020-02-07 journal: nan DOI: 10.1007/978-3-030-65965-3_30 sha: 899837ac827beb9f32e7628b98f6399873739754 doc_id: 279583 cord_uid: qhym82k6 Recently we see a rising number of methods in the field of eXplainable Artificial Intelligence. To our surprise, their development is driven by model developers rather than a study of needs for human end users. The analysis of needs, if done, takes the form of an A/B test rather than a study of open questions. To answer the question"What would a human operator like to ask the ML model?"we propose a conversational system explaining decisions of the predictive model. In this experiment, we developed a chatbot called dr_ant to talk about machine learning model trained to predict survival odds on Titanic. People can talk with dr_ant about different aspects of the model to understand the rationale behind its predictions. Having collected a corpus of 1000+ dialogues, we analyse the most common types of questions that users would like to ask. To our knowledge, it is the first study which uses a conversational system to collect the needs of human operators from the interactive and iterative dialogue explorations of a predictive model. Machine Learning models are widely adopted in all areas of human life. As they often become critical parts of the automated systems, there is an increasing need for understanding their decisions and ability to interact with such systems. Hence, we are currently seeing the growth of the area of eXplainable Artificial Intelligence (XAI). For instance, Scantamburlo et al. [28] raise an issue of understanding machine decisions and their consequences on the example of computermade decisions in criminal justice. This example touches upon such features as fairness, equality, transparency and accountability. Ribera & Lapedriza [26] identify the following motivations for why to design and use explanations: system verification, including bias detection; improvement of the system (debugging); learning from the system's distilled knowledge; compliance with legislation, e.g. "Right to explanation" set by EU; inform people affected by AI decisions. We see the rising number of explanation methods, such as LIME [25] and SHAP [15] and XAI frameworks such as AIX360 [2] , InterpretML [22] , DALEX [4] , modelStudio [3] , exBERT [10] and many others. These systems require a systematic quality evaluation [8, 21, 13] . For instance, Tan et al. [32] describe the uncertainty of explanations and Molnar et al. [20] describe a way to quantify the interpretability of the model. These methods and toolboxes are focused on the model developer perspective. Most popular methods like Partial Dependence Plots, LIME or SHAP are tools for a post-hoc model diagnostic rather than tools linked with the needs of end users. But it is important to design an explanation system for its addressee (explainee). Both form and content of the system should be adjusted to the end user. And while explainees might not have the AI expertise, explanations are often constructed by engineers and researchers for themselves [19] , therefore limiting its usefulness for the other audience [17] . Also, both the form and the content of the explanations should differ depending on the explainee's background and role in the model lifecycle. Ribera & Lapedriza [26] describe three types of explainees: AI researchers and developers, domain experts and the lay audience. Tomsett et al. [33] introduce six groups: creators, operators, executors, decision-subjects, data-subjects and examiners. These roles are positioned differently in the pipeline. Users differ in the background and the goal of using the explanation system. They vary in the technical skills and the language they use. Finally, explanations should have a comprehensible form -textual, visual or multimodal. Explanation is a cognitive process and a social interaction [7] . Moreover, interactive exploration of the model allows to personalize the explanations presented to the explainee [31] . Arya et al. identify a space for interactive explanations in a tree-shaped taxonomy of XAI techniques [2] . However, AIX360 framework presented in this paper implements only static explanations. Similarly, most of the other toolkits and methods focus entirely on the static branch of the explanations taxonomy. Sokol & Flach [29] propose conversation using class-contrastive counterfactual statements. This idea is implemented as a conversational system for the credit score systems lay audience [30] . Pecune et al. describe conversational movie recommendation agent explaining its recommendations [23] . A rule-based, interactive and conversational agent for explainable AI is also proposed by Werner [35] . Madumal et al. propose an interaction protocol and identify components of an explanation dialogue [16] . Finally, Miller [18] claims that truly explainable agents will use interactivity and communication. To address these problems we create an open-ended dialog based explanation system. We develop a chatbot allowing the explainee to interact with a predictive model and its explanations. We implement this particular system for the random forest model trained on Titanic dataset [1, 5] . However, any model trained on this dataset can be plugged into this system. Also, this approach can be applied successfully to other datasets and much of the components can be reused. Our goal is twofold. Firstly, we create a working prototype of a conversational system for XAI. Secondly, we want to discover what questions people ask to understand the model. This exploration is enabled by the open-ended nature of the chatbot. It means that the user might ask any question even if the system is unable to give a satisfying answer for each of them. There are engineering challenges of building a dialogue agent and the "Wizard of Oz" proxy approach might be used as an alternative [31, 11] . In this work however, we decide to build such a system. With this approach we obtain a working prototype and a scalable dialogue collection process. As a result, we gain a better understanding of how to answer the explanatory needs of a human operator. With this knowledge, we will be able to create explanation systems tailored to explainee's needs by addressing their questions. It is in contrast to developing new methods blindly or according to the judgement of their developers. We outline the scope and capabilities of a dialogue agent (Section 2). In Section 3, we illustrate the architecture of the entire system and describe each of the components. We also demonstrate the agent's work on the examples. Finally, in Section 4, we describe the experiment and analyze the collected dialogues. This dialogue system is a multi-turn chatbot with the user initiative. It offers a conversation about the underlying random forest model trained on the wellknown Titanic dataset. We deliberately select a black box model with no direct interpretation together with a dataset and a problem that can be easily imagined for a wider audience. The dialogue system was built to understand and respond to several groups of queries: -Supplying data about the passenger, e.g. specifying age or gender. This step might be omitted by impersonating one of two predefined passengers with different model predictions. -Inference -telling users what are their chances of survival. Model imputes missing variables. -Visual explanations from the Explanatory Model Analysis toolbox [5] : Ceteris Paribus profiles [12] (addressing "what-if" questions) and Break Down plots [9] (presenting feature contributions). Note this is to offer a warm start into the system by answering some of the anticipated queries. However, the principal purpose is to explore what other types of questions might be asked. -Dialogue support queries, such as listing and describing available variables or restarting the conversation. This system was firstly trained with an initial set of training sentences and intents. After the deployment of the chatbot, it was iteratively retrained based on the collected conversations. Those were used in two ways: 1) to add new intents, 2) to extend the training set with the actual user queries, especially those which were misclassified. The final version of the dialogue agent which is used in the experiment at Section 4 consists of 40 intents and 874 training sentences. Fig. 1 . Overview of the system architecture. Explainee uses the system to talk about the blackbox model. They interact with the system using one of the interfaces. The conversation is managed by the dialogue agent which is created and trained by the chatbot admin. To create a response system queries the blackbox model for its predictions and explainers for visual explanations. A top-level chatbot architecture is depicted in Figure 1 . The system consists of several components: Human operator -addressee of the system. They chat about the blackbox model and its predictions. This dialogue agent might be deployed to various conversational platforms independently of the backend and each other. The only exception to that is rendering some of the graphical, rich messages. We used a custom web integration as a major surface. It communicates with the dialogue agent's engine sending requests with user queries and receiving text and graphical content. The frontend of the chatbot uses Vue.js and is based on dialogflow 3 repos-itory. It provides a chat interface and renders rich messages, such as plots and suggestion buttons. This integration allows to have a voice conversation using the browser's speech recognition and speech synthesis capabilities. Chatbot's engine implemented using Dialogflow framework and Node.js fulfilment code run on Google Cloud Functions. It implements the state and context. Former is used to store the passenger's data and the latter to condition response on more than the last query. For example, when the user sends a query with a number it might be classified as age or fare specification depending on the current context. -Natural-language generation (NLG) Response generation system. To build a chatbot's utterance the dialogue agent might need to use the explanations or the predictions. For this, the NLG component will query explainers or the model correspondingly. Plots, images and suggestion buttons which are part of the chatbot response are rendered as rich messages on the front end. 4. Blackbox model A random forest model was trained to predict the chance of survival on Titanic 4 . The model was trained in R [24] and converted into REST api with the plumber package [34] . The random forest model was trained with default hyperparameters. Data preprocessing includes imputation of missing values. The performance of the model on the test dataset was AUC 0.84 and F1 score 0.73. REST API exposing visual and textual model explanations from iBreakDown [9] and CeterisParibus [12] libraries. They explore the blackbox model to create an explanation. See the xai2cloud package [27] for more details. Human operator -developer of the system. They can manually retrain the system based on misclassified intents and misextracted entities. For instance, this dialogue agent was iteratively retrained based on the initial subset of the collected dialogues. This architecture works for any predictive model and tabular data. Its components differ in how they can be transferred for other tasks and datasets 5 . The user interface is independent of the rest of the system. When a dataset is fixed, the model is interchangeable. However, the dialogue agent is handcrafted and depends on the dataset as well as explainers. Change in a dataset needs to be at least reflected in an update of the data-specific entities and intents. For instance, a new set of variables needs to be covered. It is also followed by modifying the training sentences for the NLU module and perhaps some changes in the generated utterances. Adding a new explainer might require adding a new intent. Usually, we want to capture the user queries, that can be addressed with a new explanation method. Natural-language understanding module is designed to guess an intent and extract relevant parameters/entities from a user query. 5 The source code is available at https://github.com/ModelOriented/xaibot. An excerpt from an example conversation is presented in Figure 2 . The corresponding intent classification flow is highlighted in Figure 3 . The initial subset of the collected dialogues is used to improve the NLU module of the dialogue agent. As a next step, we conduct an experiment by sharing the chatbot in the Data Science community and analyzing the collected dialogues. For this experiment, we work on data collected throughout 2 weeks. This is a subset of all collected dialogues, separate from the data used to train the NLU module. Narrowing the time scope of the experiment allows to describe the audience and ensure the coherence of the data. As a next step, we filter out conversations with totally irrelevant content and those with less than 3 user queries. Finally, we obtain 621 dialogues consisting of 5675 user queries in total. The average length equals 9.14, maximum 83 and median 7 queries. We see the histogram of conversations length in Figure 4 . Note that by conversation length we mean the number of user queries which is equal to the number of turns in the dialogue (user query, chatbot response). The audience acquisition comes mostly from R and Data Science community. Users are instructed to explore the model and its explanations individually. However, they might come across a demonstration of the chatbot's capabilities potentially introducing a source of bias. We describe the results of the study in the section 4.2 and we share the statistical details about the experiment audience in the section 4.3. We analyze the content of the dialogues. Similar user queries, when different only in the formulation, are manually grouped together. For each category, we calculate the number of conversations with at least one query of this type. Numbers of occurrences are presented in Table 1 . Note that users were not prompted or hinted to ask any of these with an exception of the "what do you know about me" question. Moreover, the taxonomy defined here is independent of the intents recognized by the NLU module and is defined based on collected dialogues. Here is the list of the query types ordered decreasingly by the number of conversation they occur in. what do you know about me -this is the only query hinted to the user using the suggestion button. When the user inputs their data manually it usually serves to understand what is yet missing. However, in the scenario when the explainee impersonates a movie character it also aids understanding which information about the user is possessed by the system. 4. EDA -a general category on Exploratory Data Analysis. All questions related to data rather than the model fall into this category. For instance, feature distribution, maximum values, plot histogram for the variable v, describe/summarize the data, is dataset imbalanced, how many women survived, dataset size etc. 5. feature importance -here we group all questions about the relevance, influence, importance or effect of the feature on the prediction. We see several subtypes of that query: Examples: which class has the highest survival chance, are men more likely to die than women. 8. who has the best score -here, we ask about the observations that maximize/minimize the prediction. Examples: who survived/died, who is most likely to survive. It is similar to how to improve question, but rather on a per example basis. 9. model-related -these are the queries related directly to the model, rather than its predictions. We see questions about the algorithm and the code. We also see users asking about metrics (accuracy, AUC), confusion matrix and confidence. However, these are observed just a few times. 10. contrastive -question about why predictions for two observations are different. We see it very rarely. However, more often we observe the implicit comparison as a follow-up question -for instance, what about other passengers, what about Jack. 11. plot interaction -follow-up queries to interact with the displayed visual content. Not observed. 12. similar observations -queries regarding "neighbouring" observations. For instance, what about people similar to me. Not observed. We also see users creating alternative scenarios and comparing predictions for different observations manually, i.e. asking for prediction multiple times with different passenger information. Additionally, we observe explainees asking about other sensitive features, that are not included in the model, e.g. nationality, race or income. However, some of these, e.g. income, are strongly correlated with class and fare. We use Google Analytics to get insights into the audience of the experiment. Users are distributed across 59 countries with the top five (Poland, United States, United Kingdom, Germany and India, in this order) accounting for 63% of the users. Figure 5 presents demographics data on the subset of the audience (53%) for which this information is available. Depending on the area of application, different needs are linked with the concept of interpretability [14, 33] . And even for a single area of application, different actors may have different needs related to model interpretability [2] . In this paper, we presented a novel application of the dialogue system for conversational explanations of a predictive model. Detailed contributions are following (1) we presented a process based on a dialogue system allowing for effective collection of user expectations related to model interpretation, (2) we presented a xai-bot implementation for a binary classification model for Titanic data, (3) we conducted an analysis of the collected dialogues. We conduct this experiment on the survival model for Titanic. However, our prior goal of this work is to understand user needs related to the model explanation, rather than improve this specific implementation. The knowledge we gain from this experiment will aid in designing the explanations for various models trained on tabular data. One example might be survival models for COVID-19 which are currently under large interest. Conversational agent proved to work as a tool to explore and extract user needs related to the use of the Machine Learning models. This method allowed us to validate hypotheses and gather requirements for the XAI system on the example from the experiment. In this analysis, we identified several frequent patterns among user queries. Conversational agent is also a promising, novel approach to XAI as a modelhuman interface. Users were given a tool for the interactive explanation of the model's predictions. In the future, such systems might be useful in bridging the gap between automated systems and their end users. An interesting and natural extension of this work would be to compare user queries for different explainee's groups in the system, e.g. model creators, operators, examiners and decisionsubjects. In particular, it would be interesting to collect needs from explainees with no domain knowledge in Machine Learning. Similarly, it is interesting to take advantage of the process introduced in this work to compare user needs across various areas of applications, e.g. legal, medical and financial. Additionally, based on the analysis of the collected dialogues we see two related areas that would benefit from the conversational human-model interaction -Exploratory Data Analysis and model fairness based on the queries about the sensitive and bias-prone features. Titanic dataset One explanation does not fit all: A toolkit and taxonomy of ai explainability techniques modelStudio: Interactive Studio with Explanations for ML Predictive Models DALEX: Explainers for Complex Predictive Models in R Explanatory Model Analysis. Explore, Explain and Examine Predictive Models archivist: An R package for managing, recording and restoring data analysis results Towards xai: Structuring the processes of explanations Explaining explanations: An approach to evaluating interpretability of machine learning Do Not Trust Additive Explanations. arXiv e-prints exBERT: A Visual Analysis Tool to Explore Learned Representations in Transformers Models Conversational interfaces for explainable ai: A human-centred approach pyCeterisParibus: explaining Machine Learning models with Ceteris Paribus Profiles in Python An evaluation of the human-interpretability of explanation The mythos of model interpretability A Unified Approach to Interpreting Model Predictions A grounded interaction protocol for explainable artificial intelligence Towards a grounded dialog model for explainable artificial intelligence Explanation in artificial intelligence: Insights from the social sciences Explainable AI: beware of inmates running the asylum or: How I learnt to stop worrying and love the social and behavioural sciences Quantifying Interpretability of Arbitrary Machine Learning Models Through Functional Decomposition Explanation in Human-AI Systems: A Literature Meta-Review InterpretML: A Unified Framework for A model of social explanations for a conversational movie recommendation system R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing Why Should I Trust You?: Explaining the Predictions of Any Classifier Can we do better explanations? a proposal of usercentered explainable ai xai2cloud: Deploys An Explainer To The Cloud Machine decisions and human consequences Conversational Explanations of Machine Learning Predictions Through Class-contrastive Counterfactual Statements Glass-box: Explaining ai decisions with counterfactual statements through conversation with a voice-enabled virtual assistant One explanation does not fit all Why should you trust my interpretation? Understanding uncertainty in LIME predictions Interpretable to whom? a role-based model for analyzing interpretable machine learning systems LLC: plumber: An API Generator for R Explainable ai through rule-based interactive conversation We would like to thank 3 anonymous reviewers for their insightful comments and suggestions. Micha l Kuźba was financially supported by the NCN Opus grant 2016/21/B/ST6/0217.